A theory of cortical responses.

Friston K

doi:10.1098/rstb.2005.1622

A theory of cortical responses.

Friston K ¹

Affiliations

1. The Wellcome Department of Imaging Neuroscience, Institute of Neurology, University College London, 12 Queen Square, London WC1N 3BG, UK.
Authors
Friston K¹
(1 author)

ORCIDs linked to this article

Friston K | 0000-0001-7984-8909

Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 01 Apr 2005, 360(1456):815-836
https://doi.org/10.1098/rstb.2005.1622 PMID: 15937014 PMCID: PMC1569488

ReviewFree full text in Europe PMC

Abstract

This article concerns the nature of evoked brain responses and the principles underlying their generation. We start with the premise that the sensory brain has evolved to represent or infer the causes of changes in its sensory inputs. The problem of inference is well formulated in statistical terms. The statistical fundaments of inference may therefore afford important constraints on neuronal implementation. By formulating the original ideas of Helmholtz on perception, in terms of modern-day statistical theories, one arrives at a model of perceptual inference and learning that can explain a remarkable range of neurobiological facts.It turns out that the problems of inferring the causes of sensory input (perceptual inference) and learning the relationship between input and cause (perceptual learning) can be resolved using exactly the same principle. Specifically, both inference and learning rest on minimizing the brain's free energy, as defined in statistical physics. Furthermore, inference and learning can proceed in a biologically plausible fashion. Cortical responses can be seen as the brain's attempt to minimize the free energy induced by a stimulus and thereby encode the most likely cause of that stimulus. Similarly, learning emerges from changes in synaptic efficacy that minimize the free energy, averaged over all stimuli encountered. The underlying scheme rests on empirical Bayes and hierarchical models of how sensory input is caused. The use of hierarchical models enables the brain to construct prior expectations in a dynamic and context-sensitive fashion. This scheme provides a principled way to understand many aspects of cortical organization and responses. The aim of this article is to encompass many apparently unrelated anatomical, physiological and psychophysical attributes of the brain within a single theoretical perspective. In terms of cortical architectures, the theoretical treatment predicts that sensory cortex should be arranged hierarchically, that connections should be reciprocal and that forward and backward connections should show a functional asymmetry (forward connections are driving, whereas backward connections are both driving and modulatory). In terms of synaptic physiology, it predicts associative plasticity and, for dynamic models, spike-timing-dependent plasticity. In terms of electrophysiology, it accounts for classical and extra classical receptive field effects and long-latency or endogenous components of evoked cortical responses. It predicts the attenuation of responses encoding prediction error with perceptual learning and explains many phenomena such as repetition suppression, mismatch negativity (MMN) and the P300 in electroencephalography. In psychophysical terms, it accounts for the behavioural correlates of these physiological phenomena, for example, priming and global precedence. The final focus of this article is on perceptual learning as measured with the MMN and the implications for empirical studies of coupling among cortical areas using evoked sensory responses.

Free full text

Philos Trans R Soc Lond B Biol Sci. 2005 Apr 29; 360(1456): 815–836.

Published online 2005 Apr 29. https://doi.org/10.1098/rstb.2005.1622

PMCID: PMC1569488

PMID: 15937014

A theory of cortical responses

Karl Friston^*

Author information Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Abstract

It turns out that the problems of inferring the causes of sensory input (perceptual inference) and learning the relationship between input and cause (perceptual learning) can be resolved using exactly the same principle. Specifically, both inference and learning rest on minimizing the brain's free energy, as defined in statistical physics. Furthermore, inference and learning can proceed in a biologically plausible fashion. Cortical responses can be seen as the brain’s attempt to minimize the free energy induced by a stimulus and thereby encode the most likely cause of that stimulus. Similarly, learning emerges from changes in synaptic efficacy that minimize the free energy, averaged over all stimuli encountered. The underlying scheme rests on empirical Bayes and hierarchical models of how sensory input is caused. The use of hierarchical models enables the brain to construct prior expectations in a dynamic and context-sensitive fashion. This scheme provides a principled way to understand many aspects of cortical organization and responses. The aim of this article is to encompass many apparently unrelated anatomical, physiological and psychophysical attributes of the brain within a single theoretical perspective.

In terms of cortical architectures, the theoretical treatment predicts that sensory cortex should be arranged hierarchically, that connections should be reciprocal and that forward and backward connections should show a functional asymmetry (forward connections are driving, whereas backward connections are both driving and modulatory). In terms of synaptic physiology, it predicts associative plasticity and, for dynamic models, spike-timing-dependent plasticity. In terms of electrophysiology, it accounts for classical and extra classical receptive field effects and long-latency or endogenous components of evoked cortical responses. It predicts the attenuation of responses encoding prediction error with perceptual learning and explains many phenomena such as repetition suppression, mismatch negativity (MMN) and the P300 in electroencephalography. In psychophysical terms, it accounts for the behavioural correlates of these physiological phenomena, for example, priming and global precedence. The final focus of this article is on perceptual learning as measured with the MMN and the implications for empirical studies of coupling among cortical areas using evoked sensory responses.

Keywords: cortical, inference, predictive coding, generative models, Bayesian, hierarchical

Go to:

1. Introduction

This article represents an attempt to understand evoked cortical responses in terms of models of perceptual inference and learning. The specific model considered here rests on empirical Bayes, in the context of generative models that are embodied in cortical hierarchies. This model can be regarded as a mathematical formulation of the longstanding notion that ‘our minds should often change the idea of its sensation into that of its judgment, and make one serve only to excite the other’ (Locke 1690). In a similar vein, Helmholtz (1860) distinguishes between perception and sensation. ‘It may often be rather hard to say how much from perceptions as derived from the sense of sight is due directly to sensation, and how much of them, on the other hand, is due to experience and training’ (see Pollen 1999). In short, there is a distinction between percepts, which are the products of recognizing the causes of sensory input and sensation per se. Recognition (i.e. inferring causes from sensation) is the inverse of generating sensory data from their causes. It follows that recognition rests on models, learned through experience, of how sensations are caused. In this article, we will consider hierarchical generative models and how evoked cortical responses can be understood as part of the recognition process. The particular recognition scheme we will focus on is based on empirical Bayes, where prior expectations are abstracted from the sensory data, using a hierarchical model of how those data were caused. The particular implementation of empirical Bayes we consider is predictive coding, where prediction error is used to adjust the state of the generative model until prediction error is minimized and the most likely causes of sensory input have been identified.

Conceptually, empirical Bayes and generative models are related to ‘analysis-by-synthesis’ (Neisser 1967). This approach to perception, drawn from cognitive psychology, involves adapting an internal model of the world to match sensory input and was suggested by Mumford (1992) as a way of understanding hierarchical neuronal processing. The idea is reminiscent of Mackay's epistemological automata (MacKay 1956) which perceive by comparing expected and actual sensory input (Rao 1999). These models emphasize the role of backward connections in mediating predictions of lower level input, based on the activity of higher cortical levels.

Recognition is simply the process of solving an inverse problem by jointly minimizing prediction error at all levels of the cortical hierarchy. The main point of this article is that evoked cortical responses can be understood as transient expressions of prediction error, which index some recognition process. This perspective accommodates many physiological and behavioural phenomena, for example, extra classical RF effects and repetition suppression in unit recordings, the MMN and P300 in ERPs, priming and global precedence effects in psychophysics. Critically, many of these emerge from the same basic principles governing inference with hierarchical generative models.

In a series of previous papers (Friston 2002, 2003), we have described how the brain might use empirical Bayes for perceptual inference. These papers considered other approaches to representational learning as special cases of generative models, starting with supervised learning and ending with empirical Bayes. The latter predicts many architectural features, such as a hierarchical cortical system, prevalent top-down backward influences and functional asymmetries between forward and backward connections seen in the real brain. The focus of previous work was on functional cortical architectures. This paper looks at evoked responses and the relevant empirical findings, in relation to predictions and theoretical constraints afforded by the same theory. This is probably more relevant for experimental studies. We will therefore take a little time to describe recent advances in modelling evoked responses in human cortical systems to show the detailed levels of characterization it is now possible to attain.

(a) Overview

We start by reviewing two principles of brain organization, namely functional specialization and functional integration and how they rest upon the anatomy and physiology of hierarchical cortico-cortical connections. Representational inference and learning from a theoretical or computational perspective is discussed in §2. This section reviews the heuristics behind schemes using the framework of hierarchical generative models and introduces learning based on empirical Bayes that they enable. The key focus of this section is on the functional architectures implied by the model. Representational inference and learning can, in some cases, proceed using only forward connections. However, this is only tenable when processes generating sensory inputs are invertible and independent. Invertibility is precluded by nonlinear interactions among causes of sensory input (e.g. visual occlusion). These interactions create a problem for recognition that can be resolved using generative models. Generative or forward models solve the recognition problem using the a priori distribution of causes. Empirical Bayes allows these priors to be induced by sensory input, using hierarchies of backward and lateral projections that prevail in the real brain. In short, hierarchical models of representational learning are a natural choice for understanding real functional architectures and, critically, confer a necessary role on backward connections. Predictions and empirical findings that arise from the theoretical considerations are reviewed in §5–7. Implications for functional architectures, in terms of how connections are organized, functional asymmetries between forward and backward connections and how they change with learning, are highlighted in §3. Then, §4 moves from infrastructural issues to implications for physiological responses during perceptual inference. It focuses on extra classical RF effects and long-latency responses in electrophysiology. The final sections look at the effect of perceptual learning on evoked responses subtending inference, as indexed by responses to novel or deviant stimuli. We conclude with a demonstration of how plasticity, associated with perceptual learning, can be measured and used to test some key theoretical predictions.

Go to:

2. Functional specialization and integration

(a) Background

The brain appears to adhere to two fundamental principles of functional organization, integration and specialization. The distinction relates to that between ‘localizationism’ and ‘(dis)connectionism’ that dominated thinking about cortical function in the nineteenth century. Since the early anatomic theories of Gall, the identification of a particular brain region with a specific function has become a central theme in neuroscience and was the motivation for Brodmann's cytoarchitectonic work (Brodmann 1905; see also Kötter & Wanke 2005). Brodmann posited ‘areae anatomicae’ to denote distinct cortical fields that could be recognized using anatomical techniques. His goal was to create a comparative system of organs that comprised these elemental areas, each with a specific function integrated within the system (Brodmann 1909).

Initially, functional localization per se was not easy to demonstrate. For example, a meeting that took place on 4 August 1881 addressed the difficulties of attributing function to a cortical area, given the dependence of cerebral activity on underlying connections (Phillips et al. 1984). This meeting was entitled Localization of function in the cortex cerebri. Although accepting the results of electrical stimulation in dog and monkey cortex, Goltz considered that the excitation method was inconclusive because the behaviours elicited might have originated in related pathways or current could have spread to distant centres. In short, the excitation method could not be used to infer functional localization because localizationism discounted interactions or functional integration among different brain areas. It was proposed that lesion studies could supplement excitation experiments. Ironically, it was observations on patients with brain lesions some years later (see Absher & Benson 1993) that led to the concept of ‘disconnection syndromes’ and the refutation of localizationism as a complete or sufficient explanation of cortical organization. The cortical infrastructure supporting a single function may involve many specialized areas whose union is mediated by functional integration. Functional specialization and integration are not exclusive; they are complementary. Functional specialization is only meaningful in the context of functional integration and vice versa.

(b) Functional specialization and segregation

The functional role, played by any component (e.g. cortical area, subarea, neuronal population or neuron) of the brain is defined largely by its connections. Clearly, this ‘connectivity’ may transcend many scales (e.g. molecular to social). However, here we focus on anatomical connections. Certain patterns of cortical projections are so common that they could amount to rules of cortical connectivity. ‘These rules revolve around one, apparently, overriding strategy that the cerebral cortex uses—that of functional segregation’ (Zeki 1990). Functional segregation demands that cells with common functional properties be grouped together. There are many examples of this grouping (e.g. laminar selectivity, ocular dominance bands and orientation domains in V1). This architectural constraint necessitates both convergence and divergence of cortical connections. Extrinsic connections, between cortical regions, are not continuous but occur in patches or clusters. In some instances, this patchiness has a clear relationship to functional segregation. For example, the secondary visual area V2 has a distinctive cytochrome oxidase architecture, consisting of thick stripes, thin stripes and inter-stripes. When recordings are made in V2, directionally selective (but not wavelength or colour‐selective) cells are found exclusively in the thick stripes. Retrograde (i.e. backward) labelling of cells in V5 is limited to these thick stripes. All the available physiological evidence suggests that V5 is a functionally homogeneous area that is specialized for visual motion. Evidence of this nature supports the idea that patchy connectivity is the anatomical infrastructure that underpins functional segregation and specialization.

(c) The anatomy and physiology of cortico-cortical connections

If specialization depends upon connections, then important organizational principles should be embodied in their anatomy and physiology. Extrinsic connections couple different cortical areas, whereas intrinsic connections are confined to the cortical sheet. There are certain features of cortico-cortical connections that provide strong clues about their functional role. In brief, there appears to be a hierarchical organization that rests upon the distinction between forward and backward connections (Maunsell & Van Essen 1983). The designation of a connection as forward or backward depends primarily on its cortical layers of origin and termination. The important characteristics of cortico-cortical connections are listed below. This list is not exhaustive but serves to introduce some principles that have emerged from empirical studies of visual cortex.

(i) Hierarchical organization

The organization of the visual cortices can be considered as a hierarchy of cortical levels with reciprocal cortico-cortical connections among the constituent cortical areas (Maunsell & Van Essen 1983; Felleman & Van Essen 1991). Forward connections run from lower to higher areas and backward connections from higher to lower. Lateral connections connect regions within a hierarchical level. The notion of a hierarchy depends upon a distinction between extrinsic forward and backward connections (see figure 1).

An external file that holds a picture, illustration, etc.
Object name is rstb20051622f01.jpg

Figure 1

Schematic illustrating hierarchical structures in the brain and the distinction between forward, backward and lateral connections. This schematic is inspired by Mesulam's (1998) notion of sensory-fugal processing over ‘a core synaptic hierarchy, which includes the primary sensory, upstream unimodal, downstream unimodal, heteromodal, paralimbic and limbic zones of the cerebral cortex’ (see Mesulam 1998 for more details).

(ii) Reciprocal connections

Although reciprocal, forward and backward connections show a microstructural and functional asymmetry and the terminations of both show laminar specificity. Forward connections (from a low to a high level) have sparse axonal bifurcations and are topographically organized, originating in supragranular layers and terminating largely in layer 4. On the other hand, backward connections show abundant axonal bifurcation and a more diffuse topography, although they can be patchy (Angelucci et al. 2002a,b). Their origins are bilaminar/infragranular and they terminate predominantly in agranular layers (Rockland & Pandya 1979; Salin & Bullier 1995). An important distinction is that backward connections are more divergent. For example, the divergence region of a point in V5 (i.e. the region receiving backward afferents from V5) may include thick and inter-stripes in V2, whereas its convergence region (i.e. the region providing forward afferents to V5) is limited to the thick stripes (Zeki & Shipp 1988). Furthermore, backward connections are more abundant. For example, the ratio of forward efferent connections to backward afferents in the lateral geniculate is about 1:10. Another distinction is that backward connections traverse a number of hierarchical levels whereas forward connections are more restricted. For example, there are backward connections from TE and TEO to V1 but no monosynaptic connections from V1 to TE or TEO (Salin & Bullier 1995).

(iii) Functionally asymmetric forward and backward connections

Functionally, reversible inactivation studies (e.g. Sandell & Schiller 1982; Girard & Bullier 1989) and neuroimaging (e.g. Büchel & Friston 1997) suggest that forward connections are driving and always elicit a response, whereas backward connections can be modulatory. In this context, modulatory means backward connections modulate responsiveness to other inputs. At the single cell level, ‘inputs from drivers can be differentiated from those of modulators. The driver can be identified as the transmitter of RF properties; the modulator can be identified as altering the probability of certain aspects of that transmission’ (Sherman & Guillery 1998).

The notion that forward connections are concerned with the promulgation and segregation of sensory information is consistent with: (i) their sparse axonal bifurcation; (ii) patchy axonal terminations; and (iii) topographic projections. In contrast, backward connections are considered to have a role in mediating contextual effects and in the co-ordination of processing channels. This is consistent with: (i) their frequent bifurcation; (ii) diffuse axonal terminations; and (iii) more divergent topography (Salin & Bullier 1995; Crick & Koch 1998). Forward connections mediate their post-synaptic effects through fast AMPA (1.3–2.4ms decay) and GABA_A (6ms decay) receptors. Modulatory effects can be mediated by NMDA receptors. NMDA receptors are voltage-sensitive, showing nonlinear and slow dynamics (approximately 50ms decay). They are found predominantly in supragranular layers where backward connections terminate (Salin & Bullier 1995). These slow time constants again point to a role in mediating contextual effects that are more enduring than phasic sensory-evoked responses. The clearest evidence for the modulatory role of backward connections (that is mediated by ‘slow’ glutamate receptors) comes from corticogeniculate connections. In the cat LGN, cortical feedback is partly mediated by type 1 metabotropic glutamate receptors, which are located exclusively on distal segments of the relay-cell dendrites. Rivadulla et al. (2002) have shown that these backward afferents enhance the excitatory centre of the thalamic RF. ‘Therefore, cortex, by closing this corticofugal loop, is able to increase the gain of its thalamic input within a focal spatial window, selecting key features of the incoming signal.’

Angelucci et al. (2002a,b) used a combination of anatomical and physiological recording methods to determine the spatial scale of intra-areal V1 horizontal connections and inter-areal backward connections to V1. ‘Contrary to common beliefs, these (monosynaptic horizontal) connections cannot fully account for the dimensions of the surround field (of macaque V1 neurons). The spatial scale of feedback circuits from extrastriate cortex to V1 is, instead, commensurate with the full spatial range of centre-surround interactions. Thus these connections could represent an anatomical substrate for contextual modulation and global-to-local integration of visual signals.’

It should be noted that the hierarchical ordering of areas is a matter of debate and may be indeterminate. Based on computational neuroanatomic studies Hilgetag et al. (2000) conclude that the laminar hierarchical constraints presently available in the anatomical literature are ‘insufficient to constrain a unique ordering’ for any of the sensory systems analysed. However, basic hierarchical principles were evident. Indeed, the authors note, ‘All the cortical systems we studied displayed a significant degree of hierarchical organization’ with the visual and somato-motor systems showing an organization that was ‘surprisingly strictly hierarchical’.

In the post-developmental period, synaptic plasticity is an important functional attribute of connections in the brain and is thought to subserve perceptual and procedural learning and memory. This is a large and fascinating field that ranges from molecules to maps (e.g. Buonomano & Merzenich 1998; Martin et al. 2000). Changing the strength of connections between neurons is widely assumed to be the mechanism by which memory traces are encoded and stored in the central nervous system. In its most general form, the synaptic plasticity and memory hypothesis states that, ‘Activity-dependent synaptic plasticity is induced at appropriate synapses during memory formation and is both necessary and sufficient for the information storage underlying the type of memory mediated by the brain area in which that plasticity is observed’ (see Martin et al. 2000 for an evaluation of this hypothesis). A key aspect of this plasticity is that it is generally associative.

(iv) Associative plasticity

Synaptic plasticity may be transient (e.g. short-term potentiation or depression) or enduring (e.g. long-term potentiation or depression) with many different time constants. In contrast to short-term plasticity, long-term changes rely on protein synthesis, synaptic remodelling and infrastructural changes in cell processes (e.g. terminal arbours or dendritic spines) that are mediated by calcium-dependent mechanisms. An important aspect of NMDA receptors, in the induction of long‐term potentiation, is that they confer associativity on changes in connection strength. This is because their voltage-sensitivity allows calcium ions to enter the cell when, and only when, there is conjoint pre-synaptic release of glutamate and sufficient post-synaptic depolarization (i.e. the temporal association of pre- and post-synaptic events). Calcium entry renders the post-synaptic specialization eligible for future potentiation by promoting the formation of synaptic ‘tags’ (e.g. Frey & Morris 1997) and other calcium-dependent intracellular mechanisms.

In summary, the anatomy and physiology of cortico-cortical connections suggest that forward connections are driving and commit cells to a prespecified response given the appropriate pattern of inputs. Backward connections, on the other hand, are less topographic and are in a position to modulate the responses of lower areas. Modulatory effects imply the postsynaptic response evoked by presynaptic input is modulated by, or interacts in a nonlinear way with, another input. This interaction depends on nonlinear synaptic or dendritic mechanisms. Finally, brain connections are not static but are changing at the synaptic level all the time. In many instances, this plasticity is associative. In §3, we describe a theoretical perspective, provided by generative models, that highlights the functional importance of hierarchies, backward connections, nonlinear coupling and associative plasticity.

Go to:

3. Representational inference and learning

This section introduces learning and inference based on empirical Bayes. A more detailed discussion can be found in Friston (2002, 2003). We will introduce the notion of generative models and a generic scheme for their estimation. This scheme uses expectation maximization (EM; an iterative scheme that estimates conditional expectations and maximum likelihoods of model parameters, in an E- and M-step, respectively). We show that predictive coding can be used to implement EM and, in the context of hierarchical generative models, is sufficient to implement empirical Bayesian inference.

(a) Causes and representations

Here, a representation is taken to be a neuronal response that represents some ‘cause’ in the sensorium. Causes are simply the states of processes generating sensory data. It is not easy to ascribe meaning to these states without appealing to the way that we categorize things, either perceptually or conceptually. Causes may be categorical in nature, such as the identity of a face or the semantic category to which an object belongs. Others may be parametric, such as the position of an object. Even though causes may be difficult to describe they are easy to define operationally. Causes are quantities or states that are necessary to specify the products of a process generating sensory information. For the sake of simplicity, let us frame the problem of representing causes in terms of a deterministic nonlinear function.

u = g (v, θ),

(3.1)

where v is a vector (i.e. a list) of underlying causes in the environment (e.g. the velocity of a particular object, direction of radiant light, etc.), and u represents sensory input; g(v,θ) is a function, that generates inputs from the causes; θ represents the parameters of the generative model. Unlike the causes, they are fixed quantities that have to be learned. We shall see later that the parameters correspond to connection strengths in the brain's model of how inputs are caused. Nonlinearities in equation (3.1) represent interactions among the causes. These can often be viewed as contextual effects, where the expression of a particular cause depends on the context established by another. A ubiquitous example from early visual processing is the occlusion of one object by another. In a linear world, the visual sensation caused by two objects would be a transparent overlay or superposition. Occlusion is a nonlinear phenomenon because the sensory input from one object (occluded) interacts with, or depends on, the other (occluder). This interaction is an example of nonlinear mixing of causes to produce sensory data. At a cognitive level, the cause associated with the word ‘hammer’ will depend on the semantic context (that determines whether the word is a verb or a noun).

The problem the brain has to contend with is to find a function of the inputs that recognizes the underlying causes. To do this, the brain must effectively undo the interactions to disclose contextually invariant causes. In other words, the brain must perform a nonlinear unmixing of causes and context. The key point here is that the nonlinear mixing may not be invertible and that the estimation of causes from input may be fundamentally ill-posed. For example, no amount of unmixing can discern the parts of an object that are occluded by another. The corresponding indeterminacy in probabilistic learning rests on the combinatorial explosion of ways in which stochastic generative models can generate input patterns (Dayan et al. 1995). In what follows, we consider the implications of this problem. Put simply, recognition of causes from sensory data is the inverse of generating data from causes. If the generative model is not invertible then recognition can only proceed if there is an explicit generative model in the brain. This speaks to the importance of backward connections that may embody this model.

(b) Generative models and representational learning

This section introduces the basic framework within which one can understand learning and inference. This framework rests upon generative and recognition models, which are simply functions that map causes to sensory input and vice versa. Generative models afford a generic formulation of representational learning and inference in a supervised or self-supervised context. There are many forms of generative models that range from conventional statistical models (e.g. factor and cluster analysis) and those motivated by Bayesian inference and learning (e.g. Dayan et al. 1995; Hinton et al. 1995). The goal of generative models is ‘to learn representations that are economical to describe but allow the input to be reconstructed accurately’ (Hinton et al. 1995). The distinction between reconstructing inputs and learning efficient representations relates directly to the distinction between inference and learning.

(i) Inference versus learning

Generative models relate unknown causes v and unknown parameters θ, to observed inputs u. The objective is to make inferences about the causes and learn the parameters. Inference may be simply estimating the most likely cause, and is based on estimates of the parameters from learning. A generative model is specified in terms of a prior distribution over the causes p(v;θ) and the generative distribution or likelihood of the inputs given the causes p(u|v;θ). Together, these define the marginal distribution of inputs implied by a generative model

p (u; θ) = \int p (u | v; θ) p (v; θ) d v .

(3.2)

The conditional density of the causes, given the inputs, are given by the recognition model, which is defined in terms of the recognition distribution

p (v | u; θ) = \frac{p (u | v; θ) p (v; θ)}{p (u; θ)} .

(3.3)

However, as considered above, the generative model may not be inverted easily and it may not be possible to parameterize this recognition distribution. This is crucial because the endpoint of learning is the acquisition of a useful recognition model that can be applied to sensory inputs by the brain. One solution is to posit an approximate recognition or conditional density q(v;u) that is consistent with the generative model and that can be parameterized. Estimating the moments (e.g. expectation) of this density corresponds to inference. Estimating the parameters of the underlying generative model corresponds to learning. This distinction maps directly onto the two steps of EM.

(c) Expectation maximization

Here, we introduce a general scheme for inference and learning using EM (Dempster et al. 1977). To keep things simple, we will assume that we are only interested in the first moment or expectation of q(v;u), which we will denote by ϕ. This is the conditional mean or expected cause. EM is a coordinate ascent scheme that comprises an E-step and an M-step. In the present context, the E-step entails finding the conditional expectation of the causes (i.e. inference), while the M-step identifies the maximum likelihood value of the parameters (i.e. learning). Critically, both adjust the conditional causes and parameters to maximize the same objective function.

(i) The free energy formulation

EM provides a useful procedure for density estimation that has direct connections with statistical mechanics. Both steps of the EM algorithm involve maximizing a function of the densities above that corresponds to the negative free energy in physics.

\begin{array}{l} F = {〈 L 〉}_{u} \\ L = ln p (u; θ) - K L {q (v; u), p (v | u; θ)} . \end{array}

(3.4)

This objective function has two terms. The first is the likelihood of the inputs under the generative model. The second term is the Kullback–Leibler divergence¹ between the approximate and true recognition densities. Critically, the second term is always positive, rendering F a lower bound on the expected log likelihood of the inputs. This means maximizing the objective function (i.e. minimizing the free energy) is simply minimizing our surprise about the data. The E-step increases F with respect to the expected cause, ensuring a good approximation to the recognition distribution implied by the parameters θ. This is inference. The M-step changes θ, enabling the generative model to match the input density and corresponds to learning.

\begin{array}{l} Inference (E) ϕ = max_{ϕ} F \\ Learning (M) θ = max_{θ} F \end{array}

(3.5)

EM enables exact and approximate maximum likelihood density estimation for a whole variety of generative models that can be specified in terms of prior and generative distributions. Dayan & Abbot (2001) work through a series of didactic examples from cluster analysis to independent component analyses, within this unifying framework. From a neurobiological perspective, the remarkable thing about this formalism is that both inference and learning are driven in exactly the same way, namely to minimize the free energy. This is effectively the same as minimizing surprise about sensory inputs encountered. As we will see below, the implication is that the same simple principle can explain phenomena as wide-ranging as the MMN in evoked electrical brain responses to Hebbian plasticity during perceptual learning.

(d) Predictive coding

In §3(c), we established an objective function that is maximized to enable inference and learning in E- and M-steps, respectively. In this section, we consider how that maximization might be implemented. In particular, we will look at predictive coding, which is based on minimizing prediction error. Prediction error is the difference between the input observed and that predicted by the generative model and inferred causes. We will see that minimizing the free energy is equivalent to minimizing prediction error. Consider any static nonlinear generative model under Gaussian assumptions

\begin{array}{l} u = g (v, θ) + ϵ_{u} \\ v = v_{p} + ϵ_{p}, \end{array}

(3.6)

where Cov{ϵ_u}=Σ_u is the covariance of any random or stochastic part of the generative process. Priors on the causes are specified in terms of their expectation v_p and covariance Cov{ϵ_p}=Σ_p. This form will be useful in the next section when we generalize to hierarchical models. For simplicity, we will approximate the recognition density with a point mass. From equation (3.4),

\begin{array}{l} L = - \frac{1}{2} ξ_{u}^{T} ξ_{u} - \frac{1}{2} ξ_{p}^{T} ξ_{p} - \frac{1}{2} ln | Σ_{u} | - \frac{1}{2} ln | Σ_{p} | \\ ξ_{u} = Σ_{u}^{- 1 / 2} (u - g (ϕ, θ)) \\ ξ_{p} = Σ_{p}^{- 1 / 2} (ϕ - v_{p}) . \end{array}

(3.7)

The first term in equation (3.7) is the prediction error that is minimized in predictive coding. The second corresponds to a prior term that constrains or regularizes conditional estimates of the causes. The need for this term stems from the ambiguous or ill-posed nature of recognition discussed above and is a ubiquitous component of inverse solutions.

Predictive coding schemes can be regarded as arising from the distinction between forward and inverse models adopted in machine vision (Ballard et al. 1983; Kawato et al. 1993). Forward models generate inputs from causes (cf. generative models), whereas inverse models approximate the reverse transformation of inputs to causes (cf. recognition models). This distinction embraces the noninvertibility of generating processes and the ill-posed nature of inverse problems. As with all underdetermined inverse problems, the role of constraints is central. In the inverse literature, a priori constraints usually enter in terms of regularized solutions. For example, ‘Descriptions of physical properties of visible surfaces, such as their distance and the presence of edges, must be recovered from the primary image data. Computational vision aims to understand how such descriptions can be obtained from inherently ambiguous and noisy data. A recent development in this field sees early vision as a set of ill-posed problems, which can be solved by the use of regularization methods’ (Poggio et al. 1985). The architectures that emerge from these schemes suggest that ‘Feedforward connections from the lower visual cortical area to the higher visual cortical area provides an approximated inverse model of the imaging process (optics)’. Conversely, ‘…the backprojection connection from the higher area to the lower area provides a forward model of the optics’ (Kawato et al. 1993; see also Harth et al. 1987). This perspective highlights the importance of backward connections and the role of priors in enabling predictive coding schemes.

(i) Predictive coding and Bayes

Predictive coding is a strategy that has some compelling (Bayesian) underpinnings. To finesse the inverse problem posed by noninvertible generative models, constraints or priors are required. These resolve the ill-posed problems that confound recognition based on purely forward architectures. It has long been assumed that sensory units adapt to the statistical properties of the signals to which they are exposed (see Simoncelli & Olshausen 2001 for a review). In fact, the Bayesian framework for perceptual inference has its origins in Helmholtz's notion of perception as unconscious inference. Helmholtz realized that retinal images are ambiguous and that prior knowledge was required to account for perception (Kersten et al. 2004). Kersten et al. (2004) provide an excellent review of object perception as Bayesian inference and ask a fundamental question, ‘Where do the priors come from. Without direct input, how does image-independent knowledge of the world get put into the visual system?’ In §3(e), we answer this question and show how empirical Bayes allows priors to be learned and induced online during inference.

(e) Cortical hierarchies and empirical Bayes

The problem with fully Bayesian inference is that the brain cannot construct the prior expectation and variability, v_p and Σ_p, de novo. They have to be learned and also adapted to the current experiential context. This is a solved problem in statistics and calls for empirical Bayes, in which priors are estimated from data. Empirical Bayes harnesses the hierarchical structure of a generative model, treating the estimates at one level as priors on the subordinate level (Efron & Morris 1973). This provides a natural framework within which to treat cortical hierarchies in the brain, each level providing constraints on the level below. This approach models the world as a hierarchy of systems where supraordinate causes induce and moderate changes in subordinate causes. These priors offer contextual guidance towards the most likely cause of the input. Note that predictions at higher levels are subject to the same constraints, only the highest level, if there is one in the brain, is unconstrained. If the brain has evolved to recapitulate the causal structure of its environment, in terms of its sensory infrastructures, it is possible that our visual cortices reflect the hierarchical causal structure of our environment.

Next, we introduce hierarchical models and extend the parameterization of the ensuing generative model to cover priors. This means that the constraints, required by predictive coding and regularized solutions to inverse problems, are now absorbed into the learning scheme and are estimated in exactly the same way as the parameters. These extra parameters encode the variability or precision of the causes and are referred to as hyperparameters in the classical covariance component literature. Hyperparameters are updated in the M-step and are treated in exactly the same way as the parameters.

(i) Hierarchical models

Consider any level i in a hierarchy whose causes v_i are elicited by causes in the level above v_i+1. The hierarchical form of the generative model is

\begin{array}{l} u = g_{1} (v_{2}, θ_{1}) + ϵ_{1} \\ v_{2} = g_{2} (v_{3}, θ_{2}) + ϵ_{2} \\ v_{3} = \dots, \end{array}

(3.8)

with u=v₁ (cf. equation (3.6)). Technically, these models fall into the class of conditionally independent hierarchical models when the stochastic terms are independent (Kass & Steffey 1989). These models are also called parametric empirical Bayes (PEB) models because the obvious interpretation of the higher-level densities as priors led to the development of PEB methodology (Efron & Morris 1973). Often, in statistics, these hierarchical models comprise just two levels, which is a useful way to specify simple shrinkage priors on the parameters of single-level models. We will assume the stochastic terms are Gaussian with covariance Σ_i=Σ(λ_i). Therefore, v_i+1, θ_i and λ_i parameterize the means and covariances of the likelihood at each level.

p (v_{i} | v_{i + 1}; θ) = N (g_{i} (v_{i + 1}, θ_{i}), Σ_{i}) .

(3.9)

This likelihood also plays the role of a prior on v_i at the level below, where it is jointly maximized with the likelihood p(v_i−1|v_i;θ). This is the key to understanding the utility of hierarchical models. By learning the parameters of the generative distribution of level i, one is implicitly learning the parameters of the prior distribution for level i−1. This enables the learning of prior densities.

The hierarchical nature of these models lends an important context-sensitivity to recognition densities not found in single-level models. The key point here is that high-level causes v_i+1 determine the prior expectation of causes v_i in the subordinate level. This can completely change the distributions p(v_i|v_i+1;θ), upon which inference in based, in an input and context-dependent way.

(ii) Implementation

The biological plausibility of empirical Bayes in the brain can be established fairly simply. To do this, a hierarchical scheme is described in some detail. A more thorough account, including simulations of various neurobiological and psychophysical phenomena, will appear in future publications. For the moment, we will address neuronal implementation at a purely theoretical level, using the framework above.

For simplicity, we will again assume deterministic recognition. In this setting, with conditional independence, the objective function is

\begin{array}{l} L = - \frac{1}{2} ξ_{1}^{T} ξ_{1} - \frac{1}{2} ξ_{2}^{T} ξ_{2} - \dots - \frac{1}{2} ln | Σ_{1} | - \frac{1}{2} ln | Σ_{2} | - \dots \\ ξ_{i} = ϕ_{i} - g_{i} (ϕ_{i + 1}, θ_{i}) - λ_{i} ξ_{i} = {(1 + λ_{i})}^{- 1} (ϕ_{i} - g_{i} (ϕ_{i + 1}, θ_{i})) \end{array}

(3.10)

(cf. equation (3.7)). Here, $Σ_{i}^{1 / 2} = 1 + λ_{i}$ . In neuronal models, the prediction error is encoded by the activities of units denoted by ξ_i. These error units receive a prediction from units in the level above² via backward connections and lateral influences from the representational units ϕ_i being predicted. Horizontal interactions among the error units serve to decorrelate them (cf. Foldiak 1990), where the symmetric lateral connection strengths λ_i hyperparameterize the covariances of the errors Σ_i, which are the prior covariances for level i−1.

The estimators ϕ_i and parameters perform a gradient ascent on the objective function

\begin{array}{l} E : {\dot{ϕ}}_{i + 1} = \frac{\partial F}{\partial ϕ_{i + 1}} = - \frac{\partial ξ_{i}^{T}}{\partial ϕ_{i + 1}} ξ_{i} - \frac{\partial ξ_{i + 1}^{T}}{\partial ϕ_{i + 1}} ξ_{i + 1} \\ M : \begin{array}{l} {\dot{θ}}_{i} = \frac{\partial F}{\partial θ_{i}} = - {〈 \frac{\partial ξ_{i}^{T}}{\partial θ_{i}} ξ 〉}_{u} \\ {\dot{λ}}_{i} = \frac{\partial F}{\partial λ_{i}} = - {〈 \frac{\partial ξ_{i}^{T}}{\partial λ_{i}} ξ 〉}_{u} - {(1 + λ_{i})}^{- 1} . \end{array} \end{array}

(3.11)

Inferences mediated by the E-step rest on changes in the representational units, mediated by forward connections from error units in the level below and lateral interaction with error units within the same level. Similarly, prediction error is constructed by comparing the activity of representational units, within the same level, to their predicted activity conveyed by backward connections.

This is the simplest version of a very general learning algorithm. It is general in the sense that it does not require the parameters of either the generative or the prior distributions. It can learn noninvertible, nonlinear generative models and encompasses complicated hierarchical processes. Furthermore, each of the learning components has a relatively simple neuronal interpretation (see below).

Go to:

4. Implications for cortical infrastructure and plasticity

(a) Cortical connectivity

The scheme implied by equation (3.11) has four clear implications or predictions about the functional architectures required for its implementation. We now review these in relation to cortical organization in the brain. A schematic summarizing these points is provided in figure 2. In short, we arrive at exactly the same four points presented at the end of §2(c).

An external file that holds a picture, illustration, etc.
Object name is rstb20051622f02.jpg

Figure 2

Upper panel: schematic depicting a hierarchical predictive coding architecture. Here, hierarchical arrangements within the model serve to provide predictions or priors to representations in the level below. The upper circles represent error units and the lower circles functional subpopulations encoding the conditional expectation of causes. These expectations change to minimize both the discrepancy between their predicted value and the mismatch incurred by their prediction of the level below. These two constraints correspond to prior and likelihood terms, respectively (see main text). Lower panel: a more detailed depiction of the influences on representational and error units.