Featured Video Play Icon

Three Problems for the Predictive Coding Theory of Attention

Madeleine Ransom (University of British Columbia)

Sina Fazelpour (University of British Columbia)

[PDF of Ransom & Fazelpour’s paper]

[Jump to Jakob Hohwy’s comment]

[Jump to Carolyn Dicey Jennings’ comment]

[Jump to Ransom & Fazelpour’s response]

 

Abstract

While philosophers of science and epistemologists are well acquainted with Bayesian methods of belief updating, there is a new Bayesian revolution sweeping neuroscience and perceptual psychology. First proposed by Helmholtz, predictive coding is the view that the human brain is fundamentally a hypothesis generator. Though predictive coding has most prominently offered a theory of perception – the bulk of the empirical support for the theory also lies in this domain – the Bayesian framework also promises to deliver a comprehensive theory of attention that falls out of the perceptual theory without the need for positing additional machinery.

The predictive coding (PC) theory of attention proposed by Feldman & Friston (2010) and defended by Hohwy (2012, 2013) is that attention is “the process of optimizing precision of prediction errors in hierarchical perceptual inference” (Hohwy 2013, p.195). Prediction errors are measurements of the difference, or mismatch, between predicted and actual evidence. Expected precisions are a measure of how reliable, or precise, we expect the prediction error signal to be in a given context: how likely is it in a given situation that the incongruent data constitutes legitimate prediction error as opposed to noise? On this picture, attention has the functional role of guiding perceptual inference by directing processing resources towards the prediction errors with the higher expected precisions.

We argue here that this theory of attention faces significant challenges on three counts. First, while the theory may provide a successful account of endogenous spatial attention, it fails to model endogenous feature-based attention: for attention to be driven by expectations of precision, then it has to be driven to an area where a large prediction error is generated. However, this is the inverse of what is needed to drive attention towards the relevant object. We further consider whether Clark’s (2013) proposed ‘provisional detection’ solution to a similar problem raised by (Bowman, Filetti, Wyble, & Olivers, 2013) can be understood along the lines of ‘gist perception’ (Bar, 2003), and whether this resolves the issue.

Second, it is unclear how the theory may accommodate non-perceptual forms of attention such as attention to one’s thoughts. The PC theory of attention is committed to the claim that attention just is the amplification of gain on prediction error, and that this is driven by expected precision. So the proposal would be that we pay attention to our thoughts when we expect them to be precise. However, this proposal remains to be filled out. Do we expect our thoughts to be more precise on some occasions rather than others? If so, what learned causal regularity underlies this expectation?

Third, it fails to accommodate the influence of affectively salient objects or high cost situations in guiding and capturing attention. This points to a more general need to integrate both agent-level preferences and the cost of false negatives and false positives into the model, such that standards for expected precision can be adjusted. The challenge for the PC theory of attention is then to accommodate these additional influences on attention in terms of expected precisions.

 

1  Predictive coding in neuroscience

While philosophers of science and epistemologists are well acquainted with Bayesian methods of belief updating, there is a new Bayesian revolution sweeping neuroscience and perceptual psychology. First proposed by Helmholtz (2005), and with formal roots in signal processing data compression strategies (Shi & Sun, 1999) and pattern recognition in machine learning (Bishop, 2006), predictive coding is the view that the human brain is fundamentally a hypothesis generator.

On this view, the processes by which the brain tests its self-generated hypotheses against sensory evidence are seen as conforming to a hierarchical Bayesian operation; each level of the hierarchy involves a hypothesis space, with higher levels generating hypotheses about more complex and slower regularities as compared to the lower levels. The higherlevel hypothesis spaces serve to generate and constrain the lower-level hypothesis spaces, thus enabling the lower-levels to predict the evidence. When there is a mismatch between the predicted and actual evidence, a prediction error is produced and is relayed up the hierarchy, where it is used to revise the hypothesis. Through the iterative interaction between top-down signals (which encode predictions) and bottom-up signals (which encode prediction error) the generative models that can predict the evidence most accurately are selected.

Given the crucial role of sensory evidence in supervising the hypothesis testing process, it is no surprise that the view has garnered the most significant empirical support as a theory of perception (Hohwy, Roepstorff, & Friston, 2008; Huang & Rao, 2011; Stefanics, Kremlacek, & Czigler, 2014). Nonetheless, increasing numbers of neuroscientists are also adopting the predictive coding framework in some capacity in order to elucidate attention, action (Berniker & Kording, 2011; Friston, Daunizeau, Kilner, & Kiebel, 2010; Körding & Wolpert, 2006), dreaming (Hobson & Friston, 2012), schizophrenia (Adams, Perrinet, & Friston, 2012; Horga, Schatz, Abi-Dargham, & Peterson, 2014; Wilkinson, 2014), interoception and the emotions (Seth & Critchley, 2013; Seth, Suzuki, & Critchley, 2011).

 

2  The predictive coding theory of attention

The predictive coding (PC) theory of attention proposed by Feldman & Friston (2010) and defended by Hohwy (2012, 2013) is that attention is “the process of optimizing precision of prediction errors in hierarchical perceptual inference” (Hohwy 2013, p.195).[1] Prediction errors are measurements of the difference, or mismatch, between predicted and actual evidence. Expected precisions are a measure of how reliable, or precise, we expect the prediction error signal to be in a given context: how likely is it in a given situation that the incongruent data constitutes legitimate prediction error as opposed to noise?

Optimizing expected precisions is the process of guiding hypothesis revision by directing processing resources towards the prediction errors with the higher expected precisions – we attend to what is expected to be the most informative, and this information is used to preferentially revise our perceptual hypotheses. Such a practice allows us to avoid the potentially disastrous consequences of revising our hypotheses on the basis of noiseinduced prediction error.

On this picture attention has the functional role of guiding perceptual inference by directing processing resources towards the prediction errors with the higher expected precisions. Again, this results in the minimization of prediction error, though attention is concerned only with expected precision of prediction error and not directly with the accuracy of the hypotheses. However, because the estimation of expected precisions is a fundamental aspect of perceptual inference, then so is attention. While the account is meant to be a comprehensive theory that encompasses both endogenous and exogenous attention, this paper will focus primarily on the former.

In exogenous (bottom-up) attention, the presentation of a contextually salient stimulus results in an abrupt and large prediction error. This large prediction error will draw one’s attention to the unattended stimulus because of a learned causal regularity that stronger signals are more precise (Hohwy 2013). Given that on the PC theory signal is defined as prediction error, then a strong signal will be one that has a large prediction error. Large prediction errors count as stronger signals because they are expected to be more informative. Given that larger signals are expected to be more precise for this reason, the gain or amplitude of this large prediction error will be enhanced (which just amounts to paying attention to the stimulus). Attention will then cause the hypothesis to be revised preferentially in light of this prediction error.

Endogenous (top-down) attention can be understood in terms of a conscious decision to attend to a given object or spatial region, or it can be understood more minimally in terms of endogenous cueing that requires agent interpretation. Using the classic Posner paradigm as an illustrative device, the PC theory of attention provides the following account of endogenous spatial attention. First, through repeated trials the subject learns that when an arrow is shown pointing to a given area on a computer screen, an object will likely appear in that area. This learned causal regularity is a contextually mediated expectation for precision: when there is an arrow pointing towards a given location, the prediction error that will subsequently be produced by the appearance of the object in that location is expected to be precise, or reliable. Second, suppose an arrow appears on the screen, pointing to the bottom right corner. This causes two things to happen: (i) the prior probability of there being an object in the bottom right corner goes up; and (ii) the gain from the prediction error issuing from this region is increased (this is tantamount to saying that one pays attention to the bottom right corner, as gain is identified with attention). Third, when the stimulus appears it is perceived more rapidly, for two reasons: the gain on the prediction error for this spatial region is enhanced, making it such that this prediction error drives the revision of the hypothesis; and the higher prior probability accorded to the hypothesis that an object will appear in this corner makes it the case that this hypothesis is more likely to be amongst those selected to drive perceptual inference in the first place. It allows the perceptual inference process to begin with a more accurate hypothesis, and so spend less time in revisions.

 

3  A problem for the PC theory of endogenous attention

Note that the arrow cue does not predict which object will appear (unless a hypothesis has been formed for this as well via conditioning, such as that dots are likely to appear after arrow cues). However, Hohwy claims that the same sort of explanation can be applied to feature-based endogenous attention (2013, p.196). In such cases, an object ‘pops out’ of a scene when one has been given the task of looking for it. How might this work? It is crucial that it do so, as many cases of endogenous attention are those involving searches for certain features or objects over others. However, it is unclear how the account is supposed to go, given that attention must be driven by high expected precision.

To illustrate the problem that arises for the PC account, take the case of searching for one’s keys. What are the relevant precision expectations driving attention? They cannot be spatial – one doesn’t have high expected precision for any particular spatial region (beyond a few general expectations, such as that one’s lost keys typically won’t be found hanging from the ceiling). Perhaps one begins with a high expected precision for any prediction error generated by the hypothesis ‘this item is my keyset’. Certain features – silver, key-shaped, jangling if moved, will drive lower level hypotheses, meaning that any prediction error relative to these hypotheses will be accorded high expected precision. But this can’t be the proposal, because then the agent would pay attention to all items that aren’t her keys – such items would generate the largest (and hence most precise) prediction errors. It looks like the inverse is needed – the agent must pay attention to the object that generates the least prediction error with respect to the hypothesis ‘this item is my keyset’.

However, this causes the following problem for the PC theory of attention. Recall that precision expectations are expectations of reliable signals. Reliable signals are those that have a high signal to noise ratio. On the PC theory, signal is just prediction error. So reliable signals are those that generate large prediction errors. For attention to be driven by expectations of precision, then it has to be driven to an area where a large prediction error is generated. However, this is the inverse of what is needed on the presupposition that the relevant hypothesis is ‘this item is my keyset’. Instead, in this case attention is driven to the spatial region that has generated the least prediction error, and so is most accurate – it is driven to the place where one’s keys are located.

Perhaps then the relevant hypothesis ought to be instead ‘it is not the case that this item is my keyset’. This gives us what we need – the largest prediction error will be generated when it does indeed turn out that the item is one’s keyset. One problem with this solution is that it seems ad hoc, vulnerable to the criticism that just about any Bayesian hypothesis can be cooked up to fit the data. Given that negation is a relatively sophisticated concept it is rather implausible that it forms part of the content of our perceptual inference whenever we engage in endogenous cueing tasks. Not only is it a relatively more linguistically sophisticated concept for children to acquire, there is also the issue of how negation is implemented in the predictive coding framework. How might negation be represented in the generative model? What effect does it have on our hypotheses? Is there a vast increase in their complexity, insofar as there are an infinite number of objects that fail to be my keyset and so satisfy my prediction?

(Bowman, Filetti, Wyble, & Olivers, 2013) raise a related worry for the predictive coding account of endogenous attention:

“What makes attention so adaptive is that it can guide towards an object at an unpredictable location – simply on the basis of features. For example, we could ask the reader to find the nearest word printed in bold. Attention will typically shift to one of the headers, and indeed momentarily increase precision there, improving reading. But this makes precision weighting a consequence of attending. At least as interesting is the mechanism enabling stimulus selection in the first place. The brain has to first deploy attention before a precision advantage can be realized for that deployment” (207, emphasis original).

(Clark, 2013) responds to Bowman et al.’s worry as follows:

“The resolution of this puzzle lies, I suggest, in the potential assignment of precisionweighting at many different levels of the processing hierarchy. Feature-based attention corresponds, intuitively, to increasing the gain on the prediction error units associated with the identity or configuration of a stimulus (e.g. increasing the gain on units responding to the distinctive geometric pattern of a four-leaf clover). Boosting that response (by giving added weight to the relevant kind of sensory prediction error) should enhance detection of that featural cue. Once the cue is provisionally detected, the subject can fixate the right spatial region, now under conditions of “four-leafclover-there” expectation. Residual error is then amplified for that feature at that location, and high confidence in the presence of the four-leaf clover can (if you are lucky!) be obtained” (p.238).

While this answers Bowstein et al.’s worry insofar as one accepts that ‘provisional detection’ can guide spatial attention, it fails to address the original problem raised above because it fails to provide a satisfactory account of how provisional detection is accomplished in the predictive coding framework. To see this, it’s instructive to run through the example of the four-leaf clover more thoroughly. First, the system generates a hypothesis such as: ‘that’s a four-leafed clover’ or ‘that object has four heart-shaped green shapes arranged in a circle.’ The gain on the prediction error units for this hypothesis will be increased. This means that any sensory input that isn’t predicted at any level of the hypothesis generates a large prediction error that will be deemed reliable, and so will be able to preferentially revise the hypothesis. According to Clark, this upping of the gain is enough to enable ‘provisional detection’. But this just leads back to the problem that things that aren’t clovers are going to generate larger prediction errors than things that are, and since the gain has been turned up on any prediction error associated with the hypothesis, such objects will capture our attention preferentially insofar as the system is searching for a clover.

 

4  The gist perception solution

At the root of the problem is that the clover hypothesis needs to be selectively applied to the scene (to the space where the clover is located!), but this is exactly what is unknown prior to searching. Lack of spatial expectations for the clover makes it the case that the hypothesis cannot be applied selectively to the scene. However, perhaps provisional detection can be understood along the lines of ‘gist perception’. (Bar, 2003) holds that perception occurs given the system’s ability to first generate a prediction of the ‘gist’ of the scene or object using low spatial frequency visual information that results in a basiclevel categorization of the object’s identity (See also Bar et al., 2001; Barrett & Bar, 2009; Oliva & Torralba, 2001; Schyns & Oliva, 1994; Torralba & Oliva, 2003). This then allows for the more fine-grained details to be filled in using the basic-level categorization as a guide. The idea here is that such basic-level categorization could guide selective application of the clover hypothesis, ensuring that it be applied only to objects that have the coarse-grained features of four-leaf clovers. This would then guide attention to the relevant spatial locations, privileging perceptual processing of these areas.

Of course, such a proposal is only a solution if the basic level categorization itself is the result of predictive coding, and here it is unclear as to whether the ‘gist’ is constructed using the hierarchical framework in a purely top-down manner. It certainly does not rely on high-level hypotheses such as ‘clover’. Constructing the gist of a scene or object would rather be reliant on lower level properties such as shape and color. It is then a further question whether such properties are detected in a feedforward model inconsistent with predictive coding, or predicted in a feedback model consistent with predictive coding (but with lower ‘high level’ hypotheses generating the content of the gist perception). Finally, even if gist perception is amenable to a predictive coding interpretation, there is the further question of exactly how attentional gain fits into the picture here. How, for example, does it prioritize the low spatial frequency information consistent with clover configurations? If the above problem reoccurs for the predictive coding account of gist perception, then it fails as a comprehensive theory of attention.

Moreover, even supposing that a convincing PC model of feature-based attention can be crafted, the account still fails to explain other key aspects of attention, such as emotional salience and non-perceptual attention.

 

5  Non-perceptual attention

A complete theory of attention must be able to account for what at least appear to be nonperceptual elements. We can pay attention to our thoughts (sometimes called intellectual attention), ruminations, mind-wanderings, memories and imaginings, and this can occur together with perception – one needn’t close one’s eyes in order to pay attention to one’s thoughts. How can these attentional shifts be accommodated under the present proposal? The picture is rendered even more complicated insofar as sometimes these shifts are exogenous and sometimes they are endogenous – sometimes we decide to pay attention and sometimes our attention is drawn involuntarily inwards. The PC theory of attention is committed to the claim that attention just is the amplification of gain on prediction error, and that this is driven by expected precision. So the proposal would be that we pay attention to our thoughts, imaginings and memories when we expect them to be precise. That is, when they generate a strong signal (large prediction error). What does this amount to? Do we expect our thoughts to be more precise on some occasions rather than others? If so, what learned causal regularity underlies this expectation?

One preliminary suggestion for regulating attention towards our own thoughts might be that it is accomplished via what we term here ‘global gain’. With global gain, learned expected precisions work to either dampen or heighten the bottom up prediction error signal tout court, and as a consequence the gain on the top down hypothesis is heightened or dampened as well in a uniform manner. For example, in extremely poor lighting contexts there is likely to be a significant amount of noise picked up by the visual modality, and so any higher-level hypothesis should give little weight to the prediction errors generated in virtue of the visual modality in this context. Expected precision is the means by which the prediction error generated by the visual modality is dampened down – in poor lighting contexts bottom-up prediction error is less heavily weighted in virtue of low expected precision from this signal. In such cases, top-down hypotheses are given a higher weight in driving perception, and are unlikely to be significantly revised in light of extremely noisy prediction error. In the extreme version of this case, top down hypotheses aren’t modified at all by prediction error. Hohwy (2013) takes this to be an explanation of visual and auditory hallucinations in the face of extremely impoverished sensory input.

The suggestion would then be that we turn our attention inwards when bottom up prediction error is dampened down. Unfortunately, this suggestion is problematic because attention is only identified with the postsynaptic gain on bottom up prediction error units on the PC account, not top down hypotheses. Though such gain may occur, it has no attentional effects. So it cannot be the explanation for internal attention to thought.

A more promising potential avenue for exploration here might be in terms of epistemic emotions (or the emotional content of thoughts more generally, though we leave this out of the discussion for brevity’s sake). Epistemic emotions are emotions about the epistemic state of the agent. If one is possession of conflicting evidence, for example p and ~p, then one may feel conflicted or confused. Such conflict may generate a large prediction error given an expectation that one generally does not hold conflicting beliefs. This large prediction error (perhaps felt at the agent level as the epistemic emotion of confusion) may be expected to be precise, given its size. This in turn may draw the agent’s attention inwards, towards her own thoughts. That is, such feelings of confusion may be felt before one is aware that one is in possession of such conflicting evidence. So epistemic emotions then serve to guide intellectual attention first by focusing attention inwards and second by sustaining and directing the subsequent searches – one searches for the source of the conflict when one feels confused. While such a suggestion holds promise, one might wonder whether epistemic emotions are really reducible to certain kinds of prediction errors.  Moreover, there are substantive issues surrounding the integration of affective salience into the predictive coding model.

 

6  Emotional salience

There are many things that are important to our survival and wellbeing that are statistically not very likely to occur in given context. Yet, they can (and ought to) capture our attention in these contexts. This represents a problem for the PC theory of attention, because it is committed to the view that expected precisions are learned statistical regularities, and so one should only pay attention to a given spatial region or object in a context where the signal is expected to be precise. Such cases of emotionally salient objects are counterexamples to the view because they drive attention while nevertheless being unlikely to occur.

For example, suppose you walk your dog every day past a house with an unfriendly Doberman. Though the Doberman is outside in the yard only one out of twenty times, when it is it rushes the fence and startles you. As a consequence, you always attend to the yard when you walk by in order not to be startled. But notice that you don’t expect a precise signal – you don’t expect the Doberman to be in the yard, because it seldom actually is there. It is rather the extreme unpleasantness of being startled that causes you to attend.

This raises two further potential problems for the PC model of attention. First, it points a more general need to integrate agent-level preferences into the model – not only do we prefer to avoid being startled, but we also have many other preferences that guide attention. Such preferences direct attention in both a top down and bottom up manner – we may notice preferred or highly aversive objects without it being the case that such preferences are relevant to tasks that we are currently engaging in (Niu, Todd, & Anderson, 2012; Todd, Cunningham, Anderson, & Thompson, 2012). The challenge for the PC theory of attention is then to accommodate these additional influences on attention in terms of expected precisions.

Second, the initial example points to the need to factor in the cost of false negatives and false positives, such that standards for expected precision can be adjusted. In an evolutionary context, there is often a significantly higher cost to false negatives over false positives – when an animal’s survival is on the line, falsely interpreting noise as signal is the prudentially rational move (within certain boundaries, cf. Stephens, 2001). More colloquially, it’s better to be safe than sorry (or dead). On the PC model, attention is driven by signals that are expected to be precise (either because of a bottom up strong signal, or because of a top down expected precision). But attention can also be driven by the cost of getting it wrong – a noisy signal with potentially important information ought to be attended, even when it is not expected to be precise on the PC theory.

 

7  Conclusion

In conclusion, while the PC account of endogenous attention works well as an account of endogenous spatial attention, we have argued here that it fails to account for three central features of attention. First, it fails to model endogenous feature-based attention. Second, it does not accommodate non-perceptual forms of attention. Third, it fails to accommodate the influence of affectively salient objects or high cost situations in guiding and capturing attention. While predictive coding provides an attractive account of perception, the account may fail to yield a theory of attention without requiring supplementation that goes beyond the Bayesian framework.

 

References

Adams, R. A., Perrinet, L. U., & Friston, K. (2012). Smooth pursuit and visual occlusion: active inference and oculomotor control in schizophrenia. PloS One, 7(10), e47502.

Bar, M. (2003). A cortical mechanism for triggering top-down facilitation in visual object recognition. Journal of Cognitive Neuroscience.

Bar, M., Tootell, R., Schacter, D. L., Greve, D. N., Fischl, B., Mendola, J. D., … Dale, A. M. (2001). Cortical Mechanisms Specific to Explicit Visual Object Recognition. Neuron, 29(2), 529–535.

Barrett, L., & Bar, M. (2009). See it with feeling: affective predictions during object perception. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521), 1325 – 1334.

Berniker, M., & Kording, K. (2011). Bayesian approaches to sensory integration for motor control. Wiley Interdisciplinary Reviews: Cognitive Science, 2(4), 419–428.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Bowman, H., Filetti, M., Wyble, B., & Olivers, C. (2013). Attention is more than prediction precision. The Behavioral and Brain Sciences, 36(3), 206–8.

Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. The Behavioral and Brain Sciences, 36(3), 181–204.

Friston, K. J., Daunizeau, J., Kilner, J., & Kiebel, S. J. (2010). Action and behavior: a free-energy formulation. Biological Cybernetics, 102(3), 227–60.

Helmholtz, H. von. (2005). Treatise on physiological optics. Mineola: Dover.

Hobson, J. A., & Friston, K. J. (2012). Waking and dreaming consciousness:

neurobiological and functional considerations. Progress in Neurobiology, 98(1), 82– 98.

Hohwy, J., Roepstorff, A., & Friston, K. (2008). Predictive coding explains binocular rivalry: an epistemological review. Cognition, 108(3), 687 – 701.

Horga, G., Schatz, K. C., Abi-Dargham, A., & Peterson, B. S. (2014). Deficits in predictive coding underlie hallucinations in schizophrenia. The Journal of Neuroscience  : The Official Journal of the Society for Neuroscience, 34(24), 8072– 82.

Huang, Y., & Rao, R. P. N. (2011). Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science, 2(5), 580–593.

Körding, K. P., & Wolpert, D. M. (2006). Bayesian decision theory in sensorimotor control. Trends in Cognitive Sciences, 10(7), 319–26.

Niu, Y., Todd, R. M., & Anderson, A. K. (2012). Affective salience can reverse the effects of stimulus-driven salience on eye movements in complex scenes. Frontiers in Psychology, 3, 336.

Oliva, A., & Torralba, A. (2001). Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision, 42(3).

Rao, R., & Ballard, D. (2004). Probabilistic models of attention based on iconic representations and predictive coding. Neurobiology of Attention.

Schyns, P. G., & Oliva, A. (1994). From blobs to boundary edges: Evidence for time- and spatial-scale-dependent scene recognition. Psychological Science, 5(4).

Seth, A. K., & Critchley, H. D. (2013). Extending predictive processing to the body:

emotion as interoceptive inference. The Behavioral and Brain Sciences, 36(3), 227– 8.

Seth, A. K., Suzuki, K., & Critchley, H. D. (2011). An interoceptive predictive coding model of conscious presence. Frontiers in Psychology, 2, 395.

Shi, Y. Q., & Sun, H. (1999). Image and video compression for multimedia engineering. CRC Press.

Spratling, M. W. (2008). Predictive coding as a model of biased competition in visual attention. Vision Research, 48(12), 1391–408.

Stefanics, G., Kremlacek, J., & Czigler, I. (2014). Visual mismatch negativity: a predictive coding view. Frontiers in Human Neuroscience, 8, 666.

Stephens, C. (2001). When Is It Selectively Advantageous to Have True Beliefs?

Sandwiching the Better Safe than Sorry Argument. Philosophical Studies: An

International Journal for Philosophy in the Analytic Tradition, 105(2), 161 – 189.

Summerfield, C., & Egner, T. (2009). Expectation (and attention) in visual cognition. Trends in Cognitive Sciences.

Todd, R. M., Cunningham, W. A., Anderson, A. K., & Thompson, E. (2012). Affectbiased attention as emotion regulation. Trends in Cognitive Sciences, 16(7).

Torralba, A., & Oliva, A. (2003). Statistics of natural image categories. Network: Computation in Neural Systems, 14(3), 391–412.

Wilkinson, S. (2014). Accounting for the phenomenology and varieties of auditory verbal hallucination within a predictive processing framework. Consciousness and Cognition, 30C, 142–155.

 

Notes

[1] Other theorists with similar views include (Rao & Ballard, 2004; Spratling, 2008; Summerfield & Egner, 2009).

22 thoughts on “Three Problems for the Predictive Coding Theory of Attention”

  1. Introduction
    Ransom & Fazelpour describe the predictive coding (PC) approach to attention and raise three objections to it. I really enjoyed reading their paper: it sees the exciting potential of the PC framework, and works intensively with it to advance our understanding of both it and attention itself. Though the three objections are interesting and challenging, I do think they can be answered.

    Before I begin on the three objections, I will make a couple of general comments about the PC framework, which will facilitate my responses to Ransom & Fazelpour’s objections.

    It is tempting to think of an organisms’s cognitive and perceptual activities in terms of resource management: the organism does what it does in order to preserve energy. Ransom & Fazelpour allude to this type of idea at some points in the paper. They also of course work with the more explicitly PC-related idea of the organism trying to minimize prediction error.

    I think it is important to keep focus on the latter notion. I don’t think organisms are geared to save energy as such. That would be a bit like saying that the organism will aim to have as low a heart rate as possible, in order to preserve resources. This would be a dangerous strategy, even if it does preserve resources in the short run. Rather, the heart rate is changed dynamically to help the organism maintain itself in its expected states. Similarly, the organism will organize itself such that in the long run it will not encounter large deviations from its expectations. This is a non-trivial task in a changing, volatile world where there is no guarantee that the state currently occupied by the agent will remain a low prediction error state.

    This is an important point because it means that occasionally the organism may occupy states with high or low prediction error, in order to maintain itself in the longer run. The agent may also anticipate a change in precision, and therefore leave a low-prediction error state in favour of a new, relatively unexplored state. Attention in the sense of precision optimization is thus critical to help the agent’s inferences in a changing, volatile world. All this tends to be obscured by casting the organism’s goals in terms of energy preservation.

    In approximate hierarchical Bayesian inference, prediction errors for the means and precisions of probability distributions are treated separately (Mathys et al. 2014). This means that we can talk of the expected precision for a given prediction error without knowing anything about the mean (or value) of that prediction error, and vice versa. This matters because it means that sometimes precisions and values can come apart and they may in some instances work antagonistically. For example, inference may, at least in the short term, be mislead by false expectations of high precisions (e.g., the expected high precision prediction error for a stimulus may direct attention to that stimulus even if it isn’t in fact precise). This predicts that there will be a varied and dynamic inferential landscape.

    Objection 1
    Ransom & Fazelpour raise the issue that whereas Posner-like spatial attention may fit well with the PC approach, it is less obvious that feature-based attention will be amenable to the predictive coding approach. I agree that feature-based attention presents a less straightforward different challenge than spatial attention. But my view is the same as Andy Clark’s, cited by Ransom & Fazelpour, namely that feature-based attention is a matter of increasing the gain on prediction error in some feature space. For example, attention to faces should increase prediction error that feeds into the occipital and fusiform face areas, rather than error from a certain area of visual space.

    Ransom & Fazelpour argue that up-regulating prediction error for the attended feature will have the opposite effect, namely of the individual perceiving everything but the supposedly attended feature. This is how I understand their argument: Attending to a feature is increasing the gain on that feature; this increases the expectation that the feature will occur, so everything else than the attended feature will generate strong prediction error relative to that attention-driven expectation; since strong prediction error drives perception, the individual will perceive all those other things.

    I don’t think this is an objection to the PC view of attention. When the system endogenously attends to an as yet unseen feature, the precision of prediction error for that feature is expected to be high, which causes increased gain for that prediction error. Crucially, in Bayesian inference, weights must sum to one to be meaningful so when the gain is increased on some prediction error it must decrease on others. Even if the system is surprised to see other features than the one attended to, the prediction error giving rise to this surprise is curtailed due to the pre-emptive down-weighting of those prediction errors. This allows the system to preferentially favour the attended-to feature (e.g., the lost keys).

    There is however something new and important about Ransom & Fazelpour’s reasoning, which speaks to the phenomenology of attention. I think it is part of feature-based attention that it is very hard to sustain for very long, especially if the attended to feature is hard to find and if there are many distractors. As anyone will know who has tried to spot Waldo in Where is Waldo? books, it is very easy to get distracted by the non-target drawings. This may be a reflection of the kinds of antagonistic processes of dampening down the gain on the prediction error from distractors, but also being surprised to encounter them.

    In other words, Ransom & Fazelpour’s objection highlights why maintaining attention can often be such hard work. This is important because, without these antagonistic processes at work, the PC account of attention could make attention seem too easy: just up the gain on the feature in question and then sit back and wait for it to pop out of the observed scene.

    This new facet of the PC account of attention is also interesting because it highlights how endogenous attention is a very active process, where the system needs to constantly mind the assignments of priors and precisions in order to be able to approximate Bayesian inference.

    The objection Ransom & Fazelpour raises here is thus not, in my view, damaging to the PC account of attention. Instead it in fact highlights important aspects of the phenomenology of attention, which the PC account can accommodate.

    Ransom & Fazelpour next focus on the issue whether gist perception can help pin down the spatial region where the endogenously attended object may be. The appeal to gist makes sense as an attempted answer to their first objection. Of course, I have just argued that this objection can be answered more directly by the PC account, so one does not need to have a separate account of spatial location to facilitate feature-based attention. I will quickly consider gist in this context however. Ransom & Fazelpour first ask if gist perception is even a good fit for PC, and then they comment that gist does not seem to give the right perceptual grain to do the job.

    I think the case of gist is interesting in itself. There is good reason to think that it would fit into the PC mold (together with everything else). Some examples of gist may be conceived as perception based on hastily glimpsed scenes but where there was no opportunity for minimizing very much of the prediction error throughout the levels of the perceptual hierarchy. On this conception, gist may be perceptual inference with low precision throughout the low levels of the cortical hierarchy, and enough precision (or evidence accumulation) higher up to elevate a particular overall hypothesis for inference (e.g., summary statistics of landscapes like beaches and cityscapes but without precision for detailed sensory attributes). This account of gist fits PC: Gist is just a hypothesis where there is yet to be rich optimization of the precisions of low level priors.

    Described like this, gist would not solve the problem of providing exact spatial information about where the attended attribute would be found. However, I think that gist may guide this process to some extent. If my gisty perception is that I am looing at a grassy knoll, then I’d expect to find clovers around the knoll and not up in the sky.

    Objection 2
    The second objection concerns attention to non-perceptual mental states, in particular thought. Ransom & Fazelpour point out that the PC account does not say very much about thought, and nothing really about attention to thought. They do consider a situation where global lowering of gain on low-level prediction error leads to global increase in top-down predictions, and thus, in a sense engagement of thought. But they remark that mere top-down prediction is not the right target for the PC-account of attention, which focuses on gain on bottom-up signals. I think Ransom & Fazelpour are nevertheless highlighting an important type of process, which may be cashed out neurophysiologically in terms of default mode networks. These networks have been associated with mental time travel and personal level thought, under the PC account (Hohwy 2007, Gerrans 2014). However, traditionally, engagement of the default mode network is often viewed as anti-correlated with networks thought to subserve attention.

    It seems to me that perhaps appeal to the cortical hierarchy could begin to answer the question about attention to thought. We may assume that thought, on the PC account, is associated with high-level, invariant hypotheses. Attention to thought could then be associated with globally lowering the gain on prediction error at low, sensory levels. Since precisions should sum to one, this would coincide with increasing the gain on higher-level prediction errors for the more invariant representations. This would be in keeping with the PC account of attention, which focuses on the bottom-up signal.

    This proposal only scratches the surface though. As Ransom & Fazelpour point out, it is far from obvious how we should even think about prediction error in the realm of thought. I think thought is the next big frontier for the PC account.

    Ransom & Fazelpour speculate that perhaps epistemic emotions are key to attention to thought. For example, the disquiet of harbouring beliefs that p and that not p, which may attract attention to those thoughts. I think this is an interesting proposal, which might relate to some psychopathologies, such as depression, which is often marked by rumination, or aspects of psychosis, with its associated inability to let go of distressing delusional thoughts.

    Objection 3
    Ransom & Fazelpour first highlight cases where a stimulus occurs infrequently, but is highly salient. The problem for the PC account is, they argue, that such a stimulus – such as a rarely encountered big barking dog – attracts attention but not because the signal associated with it is precise (since the dog occurs only infrequently).

    The first thing to say is that attention to the stimulus itself, when it occurs, is easy to explain, since salient stimuli (big dogs barking) will have high precision (cf. signal strength). But the case here concerns attention to the suspected location of the dog when it is not barking: even if I know it is not barking very often I always attend to its location when I walk past. Ransom & Fazelpour then tie this issue to the question of how agent-level preferences and statistical biases factor into the PC account of attention.

    There is much packed into these issues, because they touch on the wider notion of active inference under the predictive processing framework (or free energy principle (Friston 2010, Hohwy 2015)). There are a number of ways to go in beginning to accommodate these issues within this wider perspective.

    It is worth noting that if the dog is known with high confidence to only occur on every 20th day, then there is no reason to think one would attend to the location on the other 19 days. If there is much uncertainty about the variability in the dog’s occurrence, on the other hand, then it makes sense to keep a keen eye on its suspected location. In other words, volatility (unknown factors implicated in the whereabouts of the dog) plays a role for allocation of attention, which is just to say that precision optimization needs to be hierarchical. This is consistent with our PC account of attention.

    On the predictive processing account, agent-level preferences are cashed out in terms of the agent’s belief that it will occupy low prediction error states in the long-term average. The agent needs to learn which states these are, that is, learn which states can be expected to have high and low prediction error. The agent also needs to learn policies for how to obtain or avoid these states.

    Part of this learning process concerns the evolution of the magnitude and precision of prediction error over time, as the agent stays in a certain state or moves to the next one. That is, the agent learns how its own actions interact with the evolution of reward. One thing the agent might learn is that false negatives can have a high cost. For example, falsely believing of a mad dog that it is not there. It may therefore be biased towards false positives. For example, staying away from the dog area even if it is not there.

    However, it is clear that a mere bias against false negatives gives an oversimplified picture of agent-level preferences. For example, learning may be curtailed to your longer term detriment if you never explore the dog’s true whereabouts. Indeed, if false positives are not checked, one may populate one’s internal model with spurious hidden causes (e.g., a dangerous dog in every back yard).

    I think this relates to the exploitation-exploration balance, which any organism must somehow optimize. Minimizing prediction error under a known, high-confidence model corresponds to exploiting its resources. This is fine to a certain extent but will begin to misfire when resources are low (that is, the precisions of prediction errors will begin to decrease and the evidence accumulated for the model will deteriorate). Before this happens, the agent should leave its current state and go explore – even if it may encounter big dogs on the way.

    This is where precisions come in, because the next state for the agent to visit will be partly determined by expected precisions. The agent should seek out the most precise among the many states the agent could visit, which are reasonably close (in probabilistic terms) to the agent’s expected state.

    This means that action and precisions are closely related: we act to visit states with high expected precision. This in turn means that action and attention are related. In fact, under the free energy principle, one can say that action is an attentional phenomenon. This is because action occurs when prediction error is minimized by seeking out new, precise evidence while decreasing the gain on – or withdrawing attention from – the current prediction error. This is a challenging idea but is core to how the system swaps between perceptual and active inference. Empirically, it is reflected in our inability to tickle ourselves, and our tendency to escalate force rather than matching it (Brown et al. 2013, Van Doorn et al. 2015).

    Concluding remarks
    I have considered Ransom & Fazelpour’s three objections to the predictive processing account of attention. In each case, I argued that the account has enough resources to answer the objections. However, all the objections very fruitfully point to important issues and areas where more research is needed: the phenomenology of attention, the probabilistic nature of thought, and the necessity of basing the predictive processing account on the wider, more agent-based notion of the free energy principle.

     

    References

    Brown, H., et al. (2013). Active inference, sensory attenuation and illusions. Cognitive Processing 14(4): 411-427.
    Friston, K. (2010). The free-energy principle: A unified brain theory? Nat Rev Neurosci 11(2): 127-138.
    Gerrans, P. (2014). The measure of madness: Philosophy of mind, cognitive neuroscience, and delusional thought. Cambridge, Mass., MIT Press.
    Hohwy, J. (2007). The sense of self in the phenomenology of agency and perception. Psyche 13(1).
    Hohwy, J. (2015). The neural organ explains the mind. Open mind. T. Metzinger et al. Frankfurt am Main, MIND Group: 1-23.
    Mathys, C. D., et al. (2014). Uncertainty in perception and the hierarchical gaussian filter. Frontiers in Human Neuroscience 8.
    Van Doorn, G., B. Paton, J. Howell and J. Hohwy (2015). Attenuated self-tickle sensation even under trajectory perturbation. Consciousness and Cognition 36(0): 147-153.

    1. Hi Jakob,

      Thanks so much for your kind comments. As you mentioned, the predictive coding framework is very resourceful (no doubt partially thanks to the sophistication of the statistical tools at its disposal). So it’d be great to explore some directions for further research that using the mathematical flexibility of the framework could result in concrete models of mental phenomena.

      To this end, here are some thoughts on your comments about our second point (regarding attention to thought). It’d be wonderful to explore these further with you:

      (1) Organizational basis: in hierarchical Bayesian/generative models (HBMs or HGMs) hypothesis spaces are hierarchically structured, typically, on the basis of their level abstractness such that higher levels of the hierarchy are taken to entertain expectations about more invariant spatiotemporal regularities. For this reason, it makes sense for concepts insofar as they are more invariant than percepts to belong to these higher levels of HGMs. Nonetheless, one question that may be worth considering is whether all concepts (and resulting thoughts) stand in this sort of relation to percepts, i.e., whether there are not concepts that could encode expectations about faster, more variable spatiotemporal patterns, and if so where in the hierarchy would they fit.

      (2) Function of thought(s): in hypothesis testing, it is generally the case that when a system is sequestered from new evidence it gets involved in model selection; in these cases, the system favors the model with the smaller number of redundant variables, thus minimizing the complexity of its models (though this depends on the inter-level priors and the type of stored evidence employed). And in their application of the predicative coding framework to dreaming, Hobson and Friston (2012, 2014) assert that this is exactly what any system (including the brain) does when decoupled from sensory perturbations, presumably at least on average in the longer run (as a side note, this is quite similar to Crick and Mitchison’s 1983 proposal about the “reverse learning” function of REM sleep, which was driven by considerations about the economy of connections in neural networks). Now, other than thoughts that are clearly about current perceptual events, most imaginative and cognitive episodes appear to be unconstrained by sensory input and motor output just in this sort of way (e.g., dreaming, mind wandering, rumination, and even directed, effortful thoughts about counterfactuals, plans, etc.). While it is generally agreed that acquiring sparse representations, which could result from complexity minimization, are very advantageous to an organism, it is not quite clear whether all these imaginative and cognitive episodes do in fact subserve the same function of complexity minimization. One interesting question is, therefore, to consider how employing the predictive coding framework, given its use of HGMs, we can draw fine-grained distinctions about the role of these episodes in mental life.

      (3) Determinants of cognitive attention: (my main point here has already been mentioned in a reply to Carolyn’s comment) according to the predictive coding framework, attention is driven by and selectively distributed on the basis of expected precision of bottom-up prediction errors. And so the same story must go for attention to cognitive and imaginative episodes. My worry here is basically the same as in the case of endogenous, feature-based attention: if we spell out all the mechanisms (e.g., affective ones) that could set the attentional gain (e.g., that negative emotions could directed and sustain attention in episodes of depressive ruminations), then what exactly is the explanatory role played by the notion of “expected precision”? Does it not turn out to be redundant?

      (4) Phenomenology of thought and imagination: finally, it’d be nice to think what sort of cognitive and imaginative experiences we should have, were the predictive coding accounts correct (e.g., if it were correct that thoughts are driven by an inferential process, that the result of the inference at these higher levels was complexity minimization, etc.). Would we get the sort of thematic coherence and relative narrative stability that we often encounter in these experiences, especially in the absence of the sensorimotor constraints that ensure the coherence and stability of experiences resulting from perceptual and active inference?

      1. These are interesting questions – there is lots to discuss. I will be brief however so as not to hijack this discussion too much.
        1. Jona Vance has raised similar questions. I am not yet sure how to go here. One option I like is to say that of course there can be high-level thought about low-level sensory attributes but that this differs from perception because there is still high levels of imprecision in the short term predictions for short term regularities. I can have beliefs about the spatiotemporal whereabouts of the fast-moving ants but they are not very precise.
        2. Nice discussion here. I think complexity reduction is one thing that might happen in the absence of input. The other thing that might happen is counterfactual processing (of the sort Anil Seth has written about), which may lead to many different outcomes. For example, as I run through a counterfactual I might end up with a completely new, creative scenario.
        3. I have left some comments on this elsewhere (below I think). Oh, and, as everyone knows, emotional valence is associated with the rate of change of free energy over time ☺
        4. Probably coherence would suffer, over time, as the system would presumably drift and become delusional if deprived of input over time (cf., perhaps, prisoners in isolation). But if thoughts are related to active inference, and to testing through action and in perception, then in the medium term I don’t see why it shouldn’t fit pretty well to our normal experience of being engaged in thought. (This seems to potentially relate to the interesting work by Jenny Windt and Thomas Metzinger on mindwandering).

  2. I am grateful to the organizers of this conference (especially Brett Castellanos) for inviting me to comment on this rich and provocative paper. I look forward to reading Jakob Hohwy’s defense of the predictive coding theory of attention (PCTa) as well as engaging in dialogue with the authors about the ideas presented in this paper. In short: Although I do not personally endorse PCTa as a theory of attention, I think that PCTa may have the resources to defend itself against at least the first and third challenges posed by Ransom and Fazelpour, and I invite them to further develop these challenges in response to the ideas I offer below.

    PCTa provides a role for attention within a Bayesian framework. Bayes’ theorem provides a way of building knowledge: we can establish the (posterior) probability of the truth of something, given some evidence, from 1) the (prior) probability of it’s truth, before gathering evidence, 2) the probability that the evidence would occur given it’s truth, and 3) the probability that such evidence would occur at all. In a Bayesian framework, we assume that Bayes’ theorem is being applied by the brain. If we take the brain’s primary role to be that of generating and testing hypotheses, we can extend this framework to the entire brain. In that case, we might think of the brain this way: top-down signals, or “hypotheses,” are generated and then compared to bottom-up signals, or “prediction errors,” to improve future top-down signals/hypotheses (and this can occur both within and across different levels of processing). But since all prediction errors are not created equal, the job of attention is to direct processing resources to the most important prediction errors. PCTa holds that the direction of attention is determined by expected precision—the more precise a prediction error is expected to be, the more processing resources are provided by attention.

    As a side note, there is at least one assumption that PCTa makes about the brain that some readers are likely to reject: as Howhy puts it, “for the special case of the brain’s attempt to represent the world…it only has the effects to go by so must somehow begin the representational task de novo” (Hohwy 2012, 75). Against the view that the brain represents the world, some readers may find this to be true of, at most, the mind: while some theorists reject representationalism altogether, others may accept representationalism for the mind, but base these on the interaction between a brain and its environment, rather than the brain by itself. It would be worthwhile for those who do not share this particular representationalist framework to explore what aspects of PCTa can be salvaged in their view. Having flagged it, I will set aside this concern for the remainder of the commentary.

    The first problem raised by Ransom and Fazelpour has to do with visual search, which they couch in the terms of “feature-based endogenous attention”: What sorts of precision hypotheses could allow for feature-based endogenous attention in PCTa? Recall that attention is assigned based on expected precision. One possibility (A) is that attention is assigned and precision expectations are high if and only if an item/feature in the visual field is the desired item/feature. Another possibility (B) is that attention is assigned and precision expectations are high if and only if an item/feature in the visual field is not the desired item/feature. In either case, we will only get a signal if there is an error, or if the expectation turns out to be wrong. So we will only get a signal in A if an item/feature in the visual field is not the desired item/feature, and we will only get a signal in B if an item/feature is the desired item/feature. So in A we can’t get a signal for the desired item at all. But B is not a plausible description of the case. So PCTa is in trouble.

    Yet, I think the PCTa has the resources to answer this problem. It need only posit that the brain makes both sorts of hypotheses, A and B: this is the desired item, this is not the desired item. In the process of visual search, we would expect a higher error rate for hypothesis A errors than hypothesis B errors—we are likely to get an error almost every time we make hypothesis A, but very rarely when we make hypothesis B. This means that we would expect the precision of the hypothesis B errors to be higher than the hypothesis A errors.

    Exploring the tension between competing hypotheses might also help us to answer Ransom and Fazelpour’s third problem for PCTa. In this problem, the authors discuss the case of fear, such as the fear of a Doberman that is rarely in someone’s yard, which, the authors think, causes one to attend to the yard each time one passes. Since in PCTa attention occurs when we predict signals to be reliable, the authors hold that it should not occur in a case such as this, where the phenomenon is not reliable, but rare. Yet the surprise of the Doberman in the first case is due to the fact that the Doberman is reliably absent. Perhaps we could say that one’s hypothesis that “this block is safe” is a reliable hypothesis, and so the presence of the Doberman generated more attention than it would if one were, say, at a dog shelter. Once one’s reliable hypothesis is challenged, it makes sense that one would begin to test a competing hypothesis—“there is danger here.” For PCTa it might be better to say that fear (or the strength of the initial error) causes one to generate a strong competing “there is danger here” hypothesis for a time, in the context of the error (while walking on that block), rather than that it causes one to attend to the outcome of a faulty “there is danger here” hypothesis.

    One concern I have with these potential responses to Ransom and Fazelpour is that they reveal for me a remaining question about PCTa—how does PCTa separate attention from other sources of signal strength? It is usual to speak of attention anytime one signal is more prominent than others, but attention is not the only source of signal prominence. While I know this to be true of PCTa, I am unclear of how to account for differences in signal strength that do not come from attention. What are the sources of these differences and how do we distinguish them from attention in cases like this? Answers to these questions would help me to better assess PCTa with respect to the challenges posed by Ransom and Fazelpour.

  3. Hi Carolyn,
    These are interesting comments. I agree that the PC account is resourceful! I look forward to what Ransom & Fazelpour might say to your comments.

    I like the idea of competing hypotheses, which is I think consistent with some of the things I suggest in terms of regulation of weights. In general, model selection will be a core part of the PC account.

    As to the last comment, I think you’re right that it is not obvious how the precision optimization story matches up with the functional role for attention. I like the prospect for revisionism about attention. It may be a broader notion than we normally think. For example, action comes out as an essentially attentional phenomenon since action is driven by allocation of resources in the light of expected precisions.

    But it may also help to say that endogenous attention is top-down gating of prediction error determined by expected precisions, so it is insensitive to other aspects of signal strength. Exogenous attention on the other hand is, as a first approximation, a matter of signal strength of the current sensory input (where the idea is that signal strength goes with an expectation for precision). I do think that in both cases, attention comes out as a broader notion than most would initially think.

    1. Thanks for that, Jakob. If I understand you right, you are saying:
      –endogenous attention increases signal strength for those prediction errors with high expected precision
      –exogenous attention just is the relative signal strength for prediction errors, or the relative size of prediction error
      Is that right? Then it might not make sense to say that exogenous attention increases signal strength, since it just is relative signal strength. In that case, would you be willing to say more about how PCTa separates exogenous attention from perception?

      1. Carloyn and Jakob, thanks for the very interesting discussion.

        It seems to me that, according to the predictive coding picture, in both exogenous and endogenous cases, attention is driven by and selectively distributed on the basis of expected precision of (bottom-up) prediction errors. Now, these expectations (about the variability of the signals and whether the source of that variability is due to the signal content as opposed to the contextual noise) can themselves be estimated using different learned regularities:

        in typical exogenous cases, the expected precisions seem to be estimated on the basis of the changes in the current signal (prediction error) itself; in the context of an environment where the variability due to noise is rather slow, sudden variations in the signal may be attributed to the signal content, rather than the contextual noise (e.g., a sudden movement or a loud noise). So the modulation is, in a sense, lateral, taking place within the units computing the error/mismatch in the same level of the hierarchy.

        In endogenous cases, on the other hand, the expected precisions seem to be estimated on the basis of the hypotheses entertained at the higher levels of the hierarchy, and so the modulation is top-down (e.g., the more abstract hypothesis that it is nighttime may yield higher expected precision for the signals delivered through the auditory channel rather than those delivered through the visual pathway).

        Now here is a worry that I have: in cases of endogenous attention, a subject’s attention seems to guide her search towards “task-relevant” features, which according to PC means that such features must enjoy higher relative expected precision. Yet it is not entirely clear whether the notion of “task-relevance” can be reduced to that of “expected precision”(if precision is used in the sense of inverse variance). Surely we can think of many examples where whether some information is task-relevant or not is orthogonal to whether it is noisy or even expected to be so. As Jakob puts it nicely, in endogenous cases, the top-down gating is “determined by expected precisions, so it is insensitive to other aspects of signal strength”. But is the gating even “determined by expected precisions”? If we have these top-down influences that determine what is relevant to the task and set the gain accordingly, what is the explanatory role played by the notion of expected precision in guiding the search? It seems that most, if not all, the explanatory work is done by other, yet to be spelled out by PC, top-down mechanisms setting the gain; hence, “expected precision” seems at best explanatorily inadequate and at worst explanatorily redundant when it comes to modulating attention in these cases.

      2. I think the jury is still out on exactly how to treat exogenous attention in the predictive coding framework, so it is to some degree a case of ‘watch this space’. But on the way I think of it, strong signals are amplified, and this is exogenous attention. They are already strong and get a further boost.

      3. Sina, you’re right to ask about endogenous attention. The account is designed in the first instance to apply to the not quite folksy notion of attention in Posner type designs. Some more moves are needed to make it more akin to ‘task relevance’ in a more common sense. I think this can be done (some moves are in the book, ahem). I think there is probably a regularity such that the tasks we undertake tend to be associated with some precision in the stimuli we engage with. I also think there is an element of active inference in endogenous and certainly voluntary attention, such that we expect our actions to increase precision (focusing the light to better count the beans).

  4. Reply to Jakob and Carolyn on the first problem

    First we would like to thank both Jakob Hohwy and Carolyn Jennings for providing us with insightful and challenging commentaries, and giving us the opportunity to further engage with predictive coding.

    With the creation of a new model within a theoretical framework, the devil is often in the details. What are the specific predictions that the model makes? How would a real life example work? This is where we have tried to press the PC model of attention with respect to endogenous feature-based attention. Both Jakob and Carolyn argue that the PC account has the resources to handle the issue, but we remain unclear on the details of how their proposals are supposed to work, and so we would like to invite them to dive into the details with us in order to fill out the account of endogenous attention more fully.

    On the PC model, while precision expectations can come apart from perceptual hypotheses, they must act through them in order to increase or decrease the gain. Prediction error is hypothesis bound, and so attentional gain must be hypothesis bound. There is no such thing as free floating prediction error, as it is always generated via a mismatch between hypothesis and the signal from the world that impinges on our senses at the lowest levels of the hierarchy.

    As we see it, this fact is the root of the problem we raised for the PC account in our paper. On bottom up theories or mixed top-down bottom-up theories of perceptual processing, there can be free floating signal from our sensory modalities that is then selected for attention, and so the same problems don’t arise. The fact that gain must operate on prediction error, and that this is always hypothesis bound, cuts off an intuitive way of thinking about feature based attention. One might think of humans as possessing color or feature detectors, with precision expectations working to turn up or turn down the deliverances of these detectors – turning up the gain on the red detector makes it easier to detect red, and so on. But this picture is misleading, as it is not a bottom up detection signal that alerts us to the presence of red stripes and so drives our attention. Rather, on the PC model it is the prediction error, and prediction error is generated when there is a mismatch between hypothesis and signal.

    So it is natural to turn to the perceptual hypotheses and ask – what must these be like in order to properly model actual attentional phenomena? Strictly speaking, perceptual hypotheses are probability density functions, on this account. But below we consider more intuitive ways of thinking about the hypotheses, in part because we want to engage our readers, but also because such functions must ultimately be translatable into something like the hypotheses below on the PC model in order to tell a coherent story about thought. (On the PC model there is no sharp distinction between perception and cognition, just a hierarchy that moves from invariant hypotheses at the very top to progressively more variant hypotheses as predictions make their way down.)

    In the spatial case, there are several options. Using the Posner paradigm again (see video or paper), let’s call the location that the arrow points to, and where the target stimulus will appear, location x. One might posit that the hypothesis is something like ‘location x is white space’, and the precision expectation turns up the gain on the visual component of this hypothesis. Of course, there is little to no prediction error being generated before the stimulus appears, because the hypothesis is correct. But when the star appears, then the hypothesis will no longer be correct, and a large prediction error, amplified by attentional gain, will be generated.

    Alternately and more speculatively, perhaps the hypothesis at time t is ‘an object will appear at location x at time t+1’. The region is expected to provide highly precise prediction error in the near future, and so perhaps in preparation the gain is turned up on the current hypothesis associated with location x – ‘location x is white space’. Again, there is little to no prediction error generated by this hypothesis currently, but upping the gain allows us to attend to this region, and when the target stimulus appears, though the hypothesis predicts that an object will appear (and perhaps even which object) and so the prediction error would be otherwise small, prediction error is amplified because it is expected to be highly precise.

    Based on these examples, what can we say about feature-based attention? At time t when we are shown the featural cue (Waldo’s red and white stripes, see video), what hypothesis is generated? Jakob writes that his view of endogenous feature-based attention is that it is “a matter of increasing the gain on prediction error in some feature space. For example, attention to faces should increase prediction error that feeds into the occipital and fusiform face areas, rather than error from a certain area of visual space” (commentary on this blog).

    But here one may wonder whether separating space and features is realistic. Expected features can only generate prediction error when the featural hypothesis is applied to a spatial location, or else it remains mysterious how prediction error is generated. I predict there will be red and white stripes, but fail to provide any location. Then it is unclear how to generate prediction error. Given that I have not really applied the hypothesis to the scene, then my sensory modalities don’t have enough to go on to provide an error. A useful analogy here might be as follows: I tell you that it will rain, but don’t say where (the information on when is provided to some degree of specificity). Your job is now to tell me if I’m right or wrong. To do so, you need to go outside and start moving around. Is it raining in location x? No. Is it raining at location y? No. And so on, until you have been to all the locations in the relevant space (or to a location where it is indeed raining).

    But you can’t provide me any feedback until you have gone to a location. If I have a hundred people working to check the various locations simultaneously then I can get the feedback more quickly, but they too must go out into the world and see what’s going on. Features exist in space, just as rain falls in specific places. Even if you were a psychic able to confirm my hypothesis without looking out in the world, this still would not provide me with information about where it is raining because this was not part of my hypothesis. Such information is useless to me, except to raise the prior probability of my hypothesis. (Note that our point that the featural hypothesis ‘there are red stripes’ must operate through spatial hypotheses is compatible with its not be tied to any specific location at higher levels of the hierarchy, and this seems the intuitive way of thinking about it – we are more sure that red stripes will occur than of where they will occur in this case.)

    So, if the picture we have outlined up until here is right, then when the spatial location for the expected featural hypothesis is unknown (as in the case of feature-based attention) it cannot be applied selectively to the scene or a guess must be made. Let’s run through both options to see how they fare.

    The first option is to apply the featural hypothesis to the entire scene. Imagine perceptual hypotheses as operating on a spatial grid that spans our entire visual range. Each square of the grid then tests the same perceptual hypothesis, something in the neighborhood of ‘this is red striped’. However, this just leads us back to the same problem that things that aren’t red striped will generate large prediction, and things that are red striped will generate smaller prediction errors. Imagine that in location x there are red stripes, and in location y there is a yellow patch. The yellow patch will generate a larger prediction error than the red stripes, and so even though both are amplified, the yellow patch should grab our attention over the stripes – the opposite of what we see in feature-based attention.

    The second option is to apply the featural hypothesis to a given location based on a guess. The location might just be chosen completely at random, or it might be informed by other expectations (for example, as Hohwy suggests, if we are searching for a four-leafed clover, gist perception may be able to narrow down the location to the grassy knoll, and our perceptual system won’t waste time applying the featural hypothesis to the sky or the lake, say. But here, once the location has been narrowed down, the more exact location must still be guessed). So the hypothesis for which we expect high precision will be something like ‘there are red stripes at location x’. Then, it gets applied to location x. If there aren’t red stripes, then a large prediction error is generated, and the hypothesis will be revised on this basis, perhaps to ‘there are red stripes at location y’ and so on until the correct location is discovered.

    However, there are two things to notice here. First, because the hypothesis is expected to be highly precise, we should be attending to all these locations where there are no red stripes in turn. But this is not what happens in many instances of feature-based attention. So this proposed solution fails to capture the phenomenology of attention. Second, on this picture, feature based attention is a function of spatial attention – we are able to attend to features by performing a spatial search. But this clashes with empirical findings that feature based and spatial attention can come apart. So this option won’t work as it stands.

    Finally, Carolyn proposes to resolve the problem by holding that our perceptual hypotheses employ two simultaneous competing hypotheses: hypothesis A ‘this is red striped’ and hypothesis B ‘this is not red striped’. She goes on to suggest that “in the process of visual search, we would expect a higher error rate for hypothesis A errors than hypothesis B errors – we are likely to get an error almost every time we make hypothesis A, but very rarely when we make hypothesis B. This means that we would expect the precision of the hypothesis B errors to be higher than the hypothesis A errors” (commentary).

    If we accept this solution then we must either take on board the commitment that we now have high expected precision for both the feature and the absence of the feature (that is, for the prediction error generated by featural hypothesis A or B), or we must deny that featural cues work to increase the expected precision for the prediction error generated by that featural hypothesis. Rather, they work to decrease (or perhaps leave untouched) our expected precisions for prediction errors associated with that feature. This strikes us as rather implausible, as it is now in contrast with the spatial case. In the spatial case the endogenous cue works to increase expected precision for the prediction error of the positive hypothesis, but in the featural it works to decrease it or leaves it untouched. This makes the fix appear ad hoc. Moreover, given that precision expectations are sensitive to context, while it may be true that in the long run hypothesis A is much more likely to be false than hypothesis B, in this context one would expect there to be a shift towards heightened precision expectations for the prediction errors generated by hypothesis A, given that one has just been shown a valid cue.

    However, rather than accepting the suggestion as it stands, here we suspect that it is prior probability, and not expected precision, that will be affected by the long term success of each hypothesis. That is, we would assign a higher prior probability to hypothesis B than to A based on the learned pattern that few things are red striped, rather than increase expected precision, given that such learning deals with the content of the hypothesis. The prior probability of a hypothesis is different than expected precision in that one might have a perceptual hypothesis with a very high prior probability but low expected precision, or vice versa.

    Indeed, we are not sure that it makes sense to talk about the expected precision of a hypothesis per se (except as a form of shorthand), rather than that of the sensory modalities that undergird the hierarchy. When it’s dark out, all the prediction errors produced by the visual component of our perceptual hypotheses become less precise (though there may be differences due to distance or uneven illumination). Each hypothesis is affected by the low expected precision.

    So hypotheses are expected to be imprecise or precise only derivatively, because they make sensory predictions. The expected precision for hypotheses may then differ insofar as they make predictions that differ in location and the extent of their reliance on different sensory modalities (a hypothesis that made predictions about something far away would be expected to generate less precise prediction errors than a hypothesis that made predictions about something up close, given that we expect more detailed visual information from things up close than far away).

    Keeping this in mind, In the ‘Where’s Waldo’ case color vision is the most obvious candidate for high expected precision, given that both hypothesis A and B are making predictions about color. Expecting high precision from all prediction error generated by color won’t do, as this will amplify prediction error for both A and B equally. In location x with the red stripes, hypothesis A will generate a small prediction error that gets amplified, and B will generate a large prediction error that gets amplified (if it makes sense to speak of gain working on competing hypotheses at all, see below). Indeed, the whole scene will be attended to some degree, because the gain is turned up on all prediction error generated by color (and the hypotheses are not yet applied selectively to the scene, but rather to each square on the grid in an attempt to locate the red stripes).

    Expecting high precision from only prediction errors concerning the color red will make it such that hypothesis A will generate a small prediction error amplified by gain and hypothesis B will generate a large prediction error that doesn’t get amplified. In this case it’s unclear which hypothesis will win out (and in the reverse case where we expect high precision from all colors that aren’t red). While neither of these issues may appear dire for the PC theory at first, we think there is a deeper problem here about the instability of this predicament.

    The instability lies in the fact that there are two hypotheses competing to accurately describe or predict the features in the same location. If the gain on one hypothesis is turned down, and the gain on the other is turned up, then we have a problem – attention is both repelled and driven to that spot. We can’t make sense of this. Moreover, we need to recall that the perceptual hierarchy is trying to optimally minimize prediction error. But the hypothesis for location x that better minimizes such error is hypothesis A, and not hypothesis B. Indeed, it is the correct perceptual hypothesis. So if the prediction error on hypothesis B were amplified as precise, and that of hypothesis A were downgraded as imprecise, then this would potentially result in the wrong perceptual hypothesis winning out – and an uninformative one at that, given that it only tells us that there are no red stripes at this location and provides no positive perceptual content!

    In addition, it’s unclear whether gain operates on competing hypotheses at all. The PC theory holds that gain must sum to one in order for it to meaningfully drive attention, but given the presumably large number of competing hypotheses at each location, this seems like it would lead to a computationally intractable disaster, along with the aforementioned problem of both driving and repelling attention to a given location and the problem of bolstering the wrong hypotheses.

    It is therefore our suggestion that gain is best thought of along the grid model we’ve been using here: gain is increased or decreased for prediction errors generated via the winning hypotheses that inhabit the grid only. If this isn’t what PC theorists have in mind, then it would be good to know! It’s worth noting that if this isn’t the picture they have in mind, then it seems to complicate the claim that gain is an integral part of attention. This is because the competition process isn’t part of our perceptual experience – only the end result. It would be at the very least strange to posit that we could attend to multiple competing hypotheses that haven’t yet finalized the contents of perception.

    Finally, in order to accept the proposed solution we must put aside substantive worries about whether negative perceptual hypotheses are realizable by a perceptual system or feasible on the Bayesian PC account. While the hypothesis space must be exhaustive, instead of A and ~A (where we take ~A as equivalent to hypothesis B here) this would instead take the form of A v B v C v D etc. In this case, it would be something like ‘red’, ‘yellow’ etc. or more intractably ‘red stripes’ ‘red polka dots’ ‘solid red’ ‘yellow stripes’ ‘yellow polka dots’ etc. Reformulating the suggestion in these terms will complicate the issue of how the priors are assigned.

    At this point, one might wonder if we have anything positive to offer. In the spirit of not being curmudgeonly sticks in the mud, here is a highly tentative suggestion for how feature-based attention might work on the PC model. Perhaps endogenous feature based attention works primarily through increasing the prior probability for the relevant featural hypotheses. To see this, consider that on the PC model our perceptual systems are constantly attempting to predict the scene at hand and how it will change, but that this is a really hard task – the world is constantly changing in ways that we can’t entirely predict, and the number of candidate perceptual hypotheses is staggeringly large. Hierarchical perceptual inference is an iterative process – top down hypotheses are revised and revised until prediction error is minimized, and only then do we perceive. Presumably, the fewer the number of iterations the more rapidly perception occurs. So, when we expect red and white stripes to occur in a scene, this can reduce the number of revisions required to our perceptual hypotheses, for these features at least.

    Perhaps things work like this: the hypothesis ‘there are red stripes here’ gets applied indiscriminately to the scene, but without upping the gain anywhere. This will result in fewer to no revisions in the places where there are stripes, and more revisions in the places where there aren’t stripes. Does this mean we perceive the stripes slightly before we perceive the other items in the scene, and this might account for their salience? We’re not sure that this is the correct picture. What is certain is that it wouldn’t be congruent with the picture of attention as simply the process of precision optimization, as now we are adding that priors do some of the work too. (Though, maybe this isn’t a terrible concession for the PC theorist to make, as it doesn’t require positing extra machinery from outside the theory.)

    Or perhaps instead what happens is that precision expectations are able to exploit the fact that such processing takes less time, and so they are able to up the gain on the small remaining prediction errors for the red striped features while the rest of the processing is still going on. This gets us what we need as the featural hypotheses are in place in their proper spatial locations before gain kicks in and we experience the pop out of these features. While this may be on the way to the correct picture of how feature-based attention is supposed to go, much work remains to refine this picture and consider whether it can accommodate the empirical findings on endogenous feature-based attention. Thanks for reading!

    1. I think it might be worthwhile focusing on what happens for just one hypothesis in the kinds of cases we’ve been discussing. This might better address Madeleine and Sina’s objection. As I hinted in the commentary, there are antagonistic processes going on for attention. In the Posner paradigm, for example, a cue-left will both increase the probability that the target dot will appear there and it will increase the gain on any prediction error from that region. High expectation implies a precise prior, which in turn means that the learning rate – or weight – is decreased on the prediction error. This means that true positives are facilitated since it won’t take much sensory sampling to be sure the target is there when it is; but it also increases the risk of false positives since even a relatively large prediction error (target absent) will also be given a lesser weight, leaving the prior unchanged. At the same time there is an increase of the weight on the prediction error, which somewhat reverses the expectation-induced suppression. This makes Bayesian inference more accurate since it makes it more likely the system will correctly detect present targets, and correctly reject cases where the target is absent. Attention thus helps ensure that all and only the target is swiftly detected. This is not a bad fit for a functional role for attention. I don’t think there is any problem here about being overly biased to not detect the target.

      How will this predictive coding mechanism work for feature-detection rather than spatial attention? A cue of some sort (e.g., someone says “watch out for faces”) will increase the probability for the hypothesis that a face is present and will also increase the gain on any face related bottom-up signal. I don’t see why there should be a bias towards non-face stimuli. It might be hard if there are no precise spatial expectations associated with this hypothesis, such that the receptive field is very large (or the whole page of the Walso book, as Carolyn suggests). There will in those cases be lots of prediction error generated from other things, and even though face-related stimuli are given an advantage by attention, eventually the prediction error will overcome the prior that there is a face occurring. This is just perceptual learning and is a good thing since if there really isn’t a face around then the system needs to learn that sustained attention to faces is not going to work.

      I do realize this is not a straight response to Madelaine’s post though!

  5. Before looking at Madeleine and Sina’s comment, I’ll mention a tweet on this post from from Phil Corlett who works on PC and delusions. He refers to the relative weighting (anti-correlation) of evidence and hypotheses and says “on the anti-correlation point, that decreases in psychosis, related to symptoms” and refers to his nice paper on delusions: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4471972/ for discussion.

  6. Hi Madeleine!

    I have a quick comment on one of the ideas that comes up in the paper, video, and this comment.

    In the video it says: “The featural hypothesis (red-and-white-striped) needs to be selectively applied to the scene—to the space where these objects are located—but this is unknown prior to searching…

    You can’t turn up the gain on the red and white striped objects until you’ve located the red and white striped objects!!”

    Here is a scenario that you may not have considered: 1) I look for Waldo’s shirt at an unknown location using a feature space that happens to include retinotopic location codes, 2) I detect a candidate for Waldo’s shirt in this feature space and this draws attention to this candidate, amplifying whatever happens to be at that retinotopic location. In this scenario, the thing directing attention does not have access to the location codes, but they come along with the feature information, so one need not search in virtue of the location to get information about the location. In fact, a recent paper by Kanai et al. argues that spatial attention only works when (visual) stimuli are conscious, and that feature-based attention is necessary to modulate unconscious stimuli: http://www.cell.com/current-biology/abstract/S0960-9822(06)02299-8

    Separately, I think the participant does have some idea for where to look in every case that I can think of. In the case you mention, the Where’s Waldo puzzle is in front of you, taking up maybe 20 degrees of visual angle, about an arm’s distance away.

    1. Hi Carolyn,

      We agree with you on your last point that the agent has some idea of where to look in most cases of feature-based attention, but we think this will be underspecified in most cases. Even educated guesses only narrow down the range of potential locations, and the PC theory needs more specificity than this to escape our worry.

      With regards to your suggestion, the first thing to note is that we are also in agreement that spatial and feature-based attention can come apart – in fact, we take this as the datum that the PC theory can’t account for in our criticism of endogenous feature based attention.

      The second thing to note is that again, the success of the suggestion will turn on the details. How does looking for the feature and detection occur? On the PC account this translates into asking what sorts of hypotheses need to be in play.

      But recall that the hypotheses must be mutually exclusive and jointly exhaustive. So if a featural hypothesis isn’t indexed to a location on the PC theory, then it’s not clear that it actually is in competition with other featural hypotheses. Different features can exist at different locations in space without there being a problem, so if a location component isn’t explicitly part of the hypotheses then it’s hard to see how they are in competition with one another. One might think that the solution here would be to index the featural hypothesis to an object, but again this looks like the same problem will arise – does object x have feature y or feature z? Well, if object x isn’t indexed to a location then why shouldn’t there exist two such objects?

      The third and trickiest issue is determining how the feature space can come to include retinotopic location codes in the first place. This seems to imply that we have already done the work of visually processing the scene in front of us and putting everything in its place, and so the bulk of the work has been done already. Here it would be crucial to specify what sorts of hypotheses are responsible for generating this information. Finally, this makes it look like the PC theory is still committed to the claim that locating the red and white objects occurs prior to turning up the gain.

  7. Thanks for an engaging paper, Madeleine and Sina,

    I am not sure I completely understand what the connection between precision expectations and endogenous attention is supposed to be, on the PC framework (which is perhaps exactly your point!). Madeleine’s responses to Hohwy and Jennings are very helpful in this regard. I am also inclined to accept Sina’s claim that task relevance is orthogonal to precision expectations, hence solving, e.g., the Waldo case, via task-relevance does nothing to help solve it in terms of precision-expectations (which makes me even more puzzled as to what the connection between endogenous attention and expected precision is supposed to be, on the PC account) .

    I any event, I would like to try to offer a response on behalf of PC to your Waldo case. It probably contains mistakes. If you could spot them it would help me see what I am missing.

    Assume the hypothesis we test against sensory stimuli is “this is Waldo” and it is tested against every object in the scene. We have prior knowledge that in “where is Waldo” puzzles there are a lot of decoys, i.e., red striped objects that are not Waldo. But we have no prior knowledge about the existence of non-red-striped things, such as, e.g., yellow things, in the scene. Compare now looking at a red striped blanket and looking at a yellow blanket. Both generate a prediction error relative to the hypothesis “this is Waldo”. The probability that the red striped blanket stimulus is really a red striped blanket (hence the prediction error is genuine rather than noise), given that there are many red striped things in the scene, is higher (all things being equal) than the probability that the yellow blanket stimulus is really a yellow blanket (rather than noise). So, because we know in advance that there are a lot of red striped decoys, we expect (other things being equal) error signals involving red striped things to be more precise (i.e., less noisy) than error signals involving non-red-striped things. This explains why attention is directed to red striped things (given that expectations of precision drive attention).

    (I have ignored the fact that the error is larger in the yellow blanket case, which apparently implies that the precision is also higher. Perhaps this contribution to precision expectation is negligible relative the contribution from the aforementioned “decoy” issue)

    What do you think?

    1. Hi Assaf,

      Thanks for your comment and suggestion! To respond to your first comment, precision expectations will be responsible for the driving or guiding aspect of attention on the PC view, as they will govern which prediction errors get amplified and which get dampened. In this way they are responsible for the selectivity of attention, in conjunction with the constraint that gain must sum to one. The enhancement portion of attention will come from the upping the gain on select prediction errors – this will correspond to increased hypothesis revision and minimization of prediction error beyond what would normally occur, so that things are perceived in more detail than they would be otherwise.

      In the suggested solution you describe, we agree that it is most intuitive to think that there will be a high expected precision for prediction error generated by hypotheses concerning all red striped items. When the featural cue is shown, this should function to raise the prior probability of red and white stripes appearing in the scene, and also cause the system to expect color based prediction error (or prediction error generated by hypotheses involving the color red) to be highly precise. However, it is just such assumptions that appear to lead the PC view into problems on our view.

      The suggested solution as we understand it is as follows: we have a hypothesis ‘this is Waldo’ that is applied to the entire scene. This hypothesis will contain a featural component ‘red stripes’, for which any prediction error will expected to be highly precise. Objects that are red and white striped (but fail to be Waldo) will generate a smaller prediction error than objects that aren’t red and white striped and also aren’t Waldo. But we will stop here, as here lies the problem with this proposal – precision expectations will amplify any prediction error that results from the featural component of the hypothesis ‘red stripes’, and so this should lead to attention to the non-red striped object. It will generate a large prediction error amplified by gain, whereas the red-striped object will generate a small prediction error amplified by gain.

      The problem is really that our only connection with the world is through prediction error – it is the only thing that moves up the hierarchy. Prediction error can only be generated by hypotheses, and so we can’t just turn up the gain on information generated by red striped objects per se. We are not operating with a partial bottom up model that builds up percepts that can then be selectively amplified.

      Finally, as a parenthetical point, you write that “the error is larger in the yellow blanket case, which apparently implies that the precision is also higher”. This isn’t quite right, as on the PC account one can have large prediction error expected to be imprecise, and vice versa.

      Here the reason the precision is expected to be high for this prediction error is that we have high precision expectations for the hypothesis ‘red striped’, and this hypothesis is being applied to the yellow blanket (indeed, it is being applied to the whole scene). Hope this all helps and that we have understood your example! Thanks for reading and engaging with this debate.

  8. Hello, and thanks to everyone for this interesting discussion.

    I’ve rather a naive question — one which is a genuine request for clarification, and not an objection in disguise. If this were a real-world conference it would be a question for peripheral chat, rather than for the lecture hall.

    Could someone explain why it is that the precision weightings must sum to one? Jakob sometimes says that they must ‘in order to be meaningful’, but I’m not sure what this means. Certainly, if I’m interested in the value of variable V then the probabilities that I assign to the various values that V might take should sum to one. But, on the face of it, nothing follows from that about how precise I should take several my ways of gauging V to be. Nor does the fact that one of my gauges is precise entail that some of the others mustn’t be (unless the gauges give different readings). Even if they do give different readings, I don’t see why I couldn’t assign them precisions that add up to less than one. This would entail some degree of giving up on the project of discovering the value of that variable, but sometimes we do give up. Even if they add up to more than one, so that I am necessary committed to a contradiction, I don’t see any reason to think that it is impossible to be so committed — as in the waterfall illusion, when things look to be moving and also look not to be. (Or is the problem meant to be that precisions of more than one would send my brain into overload, and precisions of less than one would send me to sleep?)

    As the last part of this question suggests, I think there are two things that I don’t understand here: one is something basic about the maths, but the other is something about what the relation is meant to be between the maths and the brain (given that Jakob tells us — on the top of p. 25 of his book — that the math is supposed to describe what the brain does literally, and not metaphorically).

    Any clarification of this would be very much appreciated.

    Chris

    1. Nothing naive about that question!

      If weights are assigned freely, then prediction error minimization would suffer in the long run, and then the system would not approximate Bayesian inference. It would be a sign that higher level priors have not been properly learned. Variances are difficult to learn so it might take time to learn which weights to assign in which contexts, so there can be poor inference along the way where weights are not optimally set. This should give rise to precision prediction error, and lead to precision learning.

      I guess another way of expressing some of this is that if the gain on all sensory input is increased then the system is no better off than if nothing had its gain increased in the first place. Increasing the gain from 1 to 10 for all units doesn’t benefit inference (i.e., it will fail to minimize prediction error).

      1. That’s helpful, Jakob. Thanks.

        One of the things that I find it hard to be clear about is the interplay between active and perceptual inferences in the minimization of prediction-errors for estimations of precision (and more so for estimates of higher-order precisions, insofar as there are such things). I take it that the voluntary direction of attention plays the role of active inference here, in that such attention can force a perceptual signal to become more precise by tightening up the neural tuning curves.

        That sort of attention seems to operate at a fairly short time scale, since you can go from attending over here to attending over there in about a quarter of a second. Whereas the perceptual estimation of a signal’s precision might need to sample quite a long time-run of data. That might force us to favour active inference in the short term (in ways that can’t help but be sub-optimal).

      2. Hi Jakob and Chris,

        The question about how gain operates on the PC view is a good one, as there are different ways to understand how this works and it is unclear which version the PC theory is or should be committed to.

        Given that there can be gain on top down hypotheses (where this gain does not form part of attention on the PC view), then it is natural to wonder whether this gain is included in the calculation or not. If it is, then there is the potential for no attention to occur at all when the gain for the top down signals is weighted strongly enough. This seems to be the picture Jakob has in mind when he points out that such a scenario may account for phenomena such as mindwandering that involve the Default Mode Network, as the DMN is thought to be anti-correlated with attentional processes (though against this view of mindwandering see Christoff 2012 “Undirected thought: Neural determinates and correlates” Brain Research 1428: 51-59).

        If this is the view, then one might additionally wonder whether the top down gain operates on competing top down hypotheses, or merely on the winning hypotheses that determine the contents of perception (or perhaps thought). In the first case, competing hypotheses would be included in the sum, in the second only the winning hypotheses are included.

        Regardless of whether top down gain enters into the calculation, one might also ask whether gain modulates the prediction error signals generated by competing hypotheses, and how the summation process is supposed to work here. If there are multiple hypotheses competing to minimize prediction error for a particular object/region of space, does the gain for these hypotheses have to sum to one, or does the summation take place across all competing hypotheses for all objects/spatial locations in the scene?

        Finally, putting aside the question of attention to thought, there is the issue of gain on interoceptive predictions and gain on interoceptive prediction errors (which will form part of attention). The same questions posed above will apply here, in addition to the question of whether the gain adding up to one includes the gain on these prediction errors. If the gain on interoceptive prediction errors is included in the summation, then one might see such gain as serving to direct attention to thought or perhaps to certain items in one’s visual environment.

        Understanding which picture of gain is or should be operant on the PC model is crucial to understanding whether the PC treatment of attention is adequate.

Comments are closed.