Movie reconstruction from mouse visual cortex activity
There has recently been a growing number of publications in the field of image reconstruction, primarily from fMRI data, and a comprehensive review of all the approaches is outside the scope of this paper. However, we will briefly summarize the most common approaches and how they relate to our own method. In general, image reconstruction methods can be categorized into one of four groups: direct decoding models, encoder-decoder models, invertible encoding models, and encoder model input optimization.
Direct decoders directly decode the input image/videos from neuronal activity with deep neuronal networks (Shen et al., 2019a; Zhang et al., 2020; Li et al., 2023). When training direct decoders, the decoders can be pretrained (Ren et al., 2021) or additional constraints can be added to the loss function to encourage the decoder to produce images that adhere to learned image statistics (Shen et al., 2019a; Kupershmidt et al., 2022). A direct decoder approach has been used for video reconstruction in mice (Chen et al., 2024), but in that case, the training and test movies were the same, meaning it is unclear if out-of-training set generalization was achieved (a key distinction between sensory reconstruction and stimulus identification, see previous section).
In encoder-decoder models, the aim is to combine separately trained brain encoders (brain activity to latent space) and decoders (latent space to image/video). Recently, this approach has become particularly popular because it allows the use of SOTA generative image models, such as stable diffusion (Rombach et al., 2021; Takagi and Nishimoto, 2023; Scotti et al., 2023; Chen et al., 2023; Benchetrit et al., 2023). The encoder part of the models are first trained to translate brain activity into a latent space that the pretrained generative networks can interpret. Because these latent spaces are often conditioned on semantic information, this lends itself to separate processing of low-level visual and high-level semantic information from brain activity (Scotti et al., 2023).
Invertible encoding models are encoding models which, once trained to predict neuronal activity, can implicitly be inverted to predict sensory input given brain activity. We would also include those models in this class which first compute the receptive field or preferred stimulus of neurons (or voxels) and reconstruct the input as the weighted sum of the receptive fields by their activity (Stanley et al., 1999; Thirion et al., 2006; Garasto et al., 2019; Brackbill et al., 2020; Yoshida and Ohki, 2020; Nishimoto et al., 2011). The downside of this approach is that invertible linear models generally underperform in terms of capturing the coding properties of neurons compared to more complex deep neural networks (Willeke et al., 2023).
Encoder input optimization also involves first training an encoder which predicts the activity of neurons or voxels given sensory input. Once trained, the encoder is fixed, and the input to the network is optimized using backpropagation until the predicted activity matches the observed activity (Pierzchlewicz et al., 2023). Unlike with invertible encoding models, any SOTA neuronal encoding model can be used. But like invertible models, the networks are not specifically trained to reconstruct images, so they may be less likely to extrapolate information encoded by the brain by learning general image statistics. There is some evidence to support this, static image reconstructions which were optimized to evoke similar in silico predicted neural activity also evoke more similar neural responses in vivo when compared to other methods that optimized image similarity directly (Cobos et al., 2022).
Although outlined here as four distinct classes, these approaches can be combined. For instance, encoder input optimization can be combined with image diffusion (Pierzchlewicz et al., 2023) and in principle, invertible models could also be combined in such a way.
We chose to pursue a pure encoder input optimization approach for single-cell mouse visual cortex activity for two reasons. First, there have been considerable advances in the performance of neuronal encoding models for dynamic visual stimuli (Sinz et al., 2018; Wang et al., 2025; Turishcheva et al., 2024) and we aimed to take advantage of these developments. Second, the addition of a generative decoder trained to produce high-quality images brings with it the risk of extrapolating information based on general image statistics rather than interpreting what the brain is representing. In some cases, the brain may not be encoding coherent images, and in those cases, we would argue image reconstruction should fail, rather than producing an image when only the semantic information is present.
First Appeared on
Source link