Abstract
Neural representations estimated from functional MRI (fMRI) responses to natural sounds in non-primary auditory cortical areas resemble those in intermediate layers of deep neural networks (DNNs) trained to recognize sounds. However, the nature of these representations remains poorly understood. In the current study, a convolutional DNN (YAMNet), pre-trained to map sound spectrograms to semantic categories, is used as a computer simulation of the human brain’s processing of natural sounds. A novel sound dataset is introduced and employed to test the hypothesis that sound-to-event DNNs represent basic mechanisms of sound generation (here, human actions) and physical properties of the sources (here, object materials) in their intermediate layers. Systematic changes to those latent representations are made with the help of a disentangling flow model. The manipulations are shown to cause a predictable effect on DNN’s semantic output. By demonstrating this mechanism in silico, the current study paves the way for neuroscientific experiments aiming to verify it in vivo. Code available at https://github. com/TimHenry1995/LatentAudio.



![Author ORCID: We display the ORCID iD icon alongside authors names on our website to acknowledge that the ORCiD has been authenticated when entered by the user. To view the users ORCiD record click the icon. [opens in a new tab]](https://www.cambridge.org/engage/assets/public/coe/logo/orcid.png)