Back to Publications

The Computational Magic of the Ventral Stream: Towards a Theory

Poggio T., (sections with J. Mutch, J.Z. Leibo and L. Rosasco)

Nature Precedings , doi:10.1038/npre.2011.6117.1 , 2011

Nature Precedings
doi:10.1038/npre.2011.6117.1
July 16, 2011

Abstract: I argue that the sample complexity of (biological, feedforward) object recognition is mostly due to geometric image transformations and conjecture that a main goal of the feedforward path in the ventral stream – from V1, V2, V4 and to IT – is to learn-and-discount image transformations.

In the first part of the paper I describe a class of simple and biologically plausible memorybased modules that learn transformations from unsupervised visual experience. The main theorems show that these modules provide (for every object) a signature which is invariant to local affine transformations and approximately invariant for other transformations. I also prove that, in a broad class of hierarchical architectures, signatures remain invariant from layer to layer. The identification of these memory-based modules with complex (and simple) cells in visual areas leads to a theory of invariant recognition for the ventral stream.

In the second part, I outline a theory about hierarchical architectures that can learn invariance to transformations. I show that the memory complexity of learning affine transformations is drastically reduced in a hierarchical architecture that factorizes transformations in terms of the subgroup of translations and the subgroups of rotations and scalings. I then show how translations may be automatically selected as the only learnable transformations during development by enforcing small apertures – eg small receptive fields – in the first layer.

In a third part I show that the transformations represented in each area can be optimized in terms of storage and robustness, as a consequence determining the tuning of the neurons in the area, rather independently (under normal conditions) of the statistics of natural images. I describe a model of learning that can be proved to have this property, linking in an elegant way the spectral properties of the signatures with the tuning of receptive fields in different areas.

A surprising implication of these theoretical results is that the computational goals and some of the tuning properties of cells in the ventral stream may follow from symmetry properties (in the sense of physics) of the visual world through a process of unsupervised correlational learning, based on Hebbian synapses. In particular, simple and complex cells do not directly care about oriented bars: their tuning is a side effect of their role in translation invariance. Across the whole ventral stream the preferred features reported for neurons in different areas are only a symptom of the invariances computed and represented.

The results of each of the three parts stand on their own independently of each other. Together this theory-in-fieri makes several broad predictions, some of which are:

invariance to small translations is the main operation of V1;
invariance to larger translations and small local scalings and rotations is the main characteristic of V2 and V4;
class-specific transformations are learned and represented at the top of the ventral stream hierarchy;
each cell’s tuning properties are shaped by visual experience of image transformations during developmental and adult plasticity;
the type of transformations that are learned from visual experience depend on the size of the receptive fields and thus on the area (layer in the models) – assuming that the size increases with layers;
the mix of transformations learned in each area influences the tuning properties of the cells – oriented bars in V1+V2, radial and spiral patterns in V4 up to class specific tuning in AIT (eg face tuned cells);
features must be discriminative and invariant: invariance to specific transformations is the primary determinant of the tuning of cortical neurons rather than statistics of natural images.
homeostatic control of synaptic weights during development is required for hebbian synapses.
motion is key in development and evolution;
invariance to small transformations in early visual areas may underly stability of visual perception (suggested by Stu Geman);

The theory is broadly consistent with the current version of HMAX. It explains it and extends it in terms of unsupervised learning, a broader class of transformation invariance and higher level modules. The goal of this paper is to sketch a comprehesive theory with little regard for mathematical niceties. If the theory turns out to be useful there will be scope for deep mathematics, ranging from group representation tools to wavelet theory to dynamics of learning.

Download