Featured Video Play Icon

Inference and Error in Comparative Psychology: The Case of Mindreading

Marta Halina, University of Cambridge

[PDF of Marta Halina’s paper]

[Jump to Irina Mikhalevich’s commentary]
[Jump to Robert Lurz’s commentary]
[Jump to Kristin Andrews’s commentary]


Mindreading is the ability to attribute mental states to other agents. Over the last decade, there has been a wealth of experimental work on the question of whether nonhuman animals mindread. The positive results of these experiments have led many comparative psychologists to conclude that animals attribute some mental states, such as intentions and perceptions, to others. Sceptics remain, however. They argue that one can provide alternative non-mindreading explanations for the positive results of mindreading experiments, and that insofar as this can be done, the hypothesis that animals mindread lacks evidential support. In this paper, I argue that this “alternative-hypothesis objection” depends on an oversimplified view of the relationship between theory and evidence. A more nuanced account reveals that the mindreading hypothesis is supported by the data produced by mindreading experiments, while alternative hypotheses, such as behaviour reading and submentalizing, lack such support. I conclude by considering whether these alternative hypotheses undermine the evidence for mindreading by serving as experimental confounds. I argue that their ability to do so depends on their independent evidential support and that mindreading sceptics have not done enough to show that they have such support.[1]

1. Introduction

Mindreading is the ability to attribute mental states to other agents. It is what we do when we predict and explain the behaviour of others by appealing to their beliefs, desires, intentions, and perceptions, rather than just their observable behaviour. Mindreading is thought to be ubiquitous in adult human life and to underlie many other cognitive abilities, such as empathy, self-awareness, and even phenomenal consciousness (Baron-Cohen 1997; Carruthers 2009; Apperly 2011). Many of these abilities have long been held to be uniquely human. Discovering whether nonhuman animals mindread then would dramatically affect not only how we view them, but also how we view ourselves.

Psychologists and philosophers have been pursuing the question of whether nonhuman animals (hereafter, animals) mindread for over 35 years (Premack and Woodruff 1978). From the beginning of this research program, there has been a debate over how to interpret the positive results of mindreading experiments. On the one hand are those psychologists and philosophers who take these positive results as good evidence for animal mindreading (Call and Tomasello 2008; Fletcher and Carruthers 2013; Halina 2015; Clayton 2015); on the other are those who do not (Povinelli and Vonk 2006; Penn et al. 2008; Penn and Povinelli 2007, 2009, 2013; Penn 2011; Lurz 2011; Heyes 2014, 2015; Buckner 2013). I refer to the latter group as methodological sceptics (or “sceptics”) given their doubt in the adequacy of the methods currently used to test for mindreading in animals.[2]

The “alternative-hypothesis objection” is central to the critique advanced by the sceptics.[3] This objection holds that if an alternative non-mindreading hypothesis can account for the results of a mindreading experiment, then those results do not in fact provide good evidence for mindreading. In such cases, both the mindreading and alternative hypotheses are equally well supported by the data. The sceptics often go on to conclude that we should accept one of the alternative hypotheses on the grounds of it being simpler than positing mindreading. The focus of this paper is on the first two claims, however.[4] In particular, I address the question, “does the ability to account for the results of a mindreading experiment with an alternative hypothesis (1) undermine the evidential support for the hypothesis being tested and/or (2) provide good evidence for the alternative?” My answer to both parts of this question is “no.”

Philosophers of science have been thinking about the relationship between theory and evidence for a long time, and in particular, when data should count as good evidence for a hypothesis. Few of those engaged in the mindreading debate, however, have drawn on general philosophy of science in order to better understand the alternative-hypothesis objection. I do this here by evaluating this objection within the framework of experimental testing and inference developed by error-statistical philosophers like Deborah Mayo (Mayo 1996, Mayo & Spanos 2008, Staley 2008). I argue that the alternative-hypothesis objection depends on an over-simplified view of the relationship between theory and data. This view holds that it is sufficient for data to “fit” or be consistent with a hypothesis in order for it to serve as good evidence for that hypothesis. I instead argue that in order for data to serve as good evidence for a hypothesis, it must be produced by a “severe test” or a testing procedure capable of detecting whether that hypothesis is false when it is in fact false. With this additional criterion for good evidence in place, I show that mindreading experiments constitute severe tests with respect to the mindreading hypothesis being tested, but not with respect to the alternative hypotheses of behaviour reading (Penn and Povinelli 2007) and submentalizing (Heyes 2014, 2015). Thus, although the positive results of mindreading experiments are consistent with both mindreading and these alternatives, they provide good evidence only for the former.

I begin in section 2 by introducing the logic behind mindreading experiments and why comparative psychologists take them to provide good evidence for mindreading. In section 3, I introduce the alternative-hypothesis objection. I then show in section 4 how this objection depends on an overly permissive account of good evidence and introduce severe testing as a corrective for this view. In section 5, I argue that mindreading experiments are severe tests with respect to the mindreading hypothesis, but not with respect to the alternatives of behaviour reading and submentalizing. I conclude in section 6 by considering one way in which the alternative-hypothesis objection might succeed in weakening the evidence for mindreading: by flagging experimental confounds. In order to do this, however, sceptics must show that these purported confounds have independent empirical support—something that they do not currently do.

2. Mindreading Research

How do comparative psychologists test whether animals attribute mental states to other agents? Although the particular methods vary depending on the mental state in question, the general approach used in mindreading experiments is the same.[5] Indeed, the logic follows that of experimental design in psychology more generally.[6] In order to see this, it is useful to first distinguish between two kinds of hypotheses: high-level and experimental.[7] In psychology, the high-level hypothesis is typically the cognitive account or mechanism that researchers are testing through the implementation of a battery of experiments and observational studies. These are the claims that chimpanzees understand intentions or have level 1 visual perspective taking abilities. Such high-level hypotheses give rise to a range of concrete predictions. It is these predictions that serve as the basis for experimental hypotheses, where an experimental hypothesis is a claim that some factor will vary with another factor (what will become the dependent and independent variables in an experiment). Such hypotheses include the claim that subjects will prefer to steal food from a competitor by reaching through an opaque tunnel over a transparent one (Melis et al. 2006) or request food from a cooperative agent by gesturing towards their front rather than their back (Liebal et al. 2004).

Experimental hypotheses are then tested by means of a well-designed experiment or one that is internally valid and set up to test the relationship between the two variables in question. The independent variable is the variable manipulated by researchers across conditions (such as the opaqueness of a barrier), whereas the dependent variable is that which researchers predict will be affected by the independent variable in a particular way (such as a subject’s attempt to retrieve food). Ensuring internal validity requires that researchers control nuisance variables or those factors that might affect the dependent variable other than the independent variable. Nuisance variables need to be eliminated or randomized in order to prevent them from systematically affecting the dependent variable. A nuisance variable that has such a systematic effect may either produce a difference across conditions that researchers mistakenly attribute to the independent variable or produce an effect counter to the independent variable, thereby masking the latter’s impact. When nuisance variables are randomized, their effects are taken into account in the statistical analysis of the data. If the results of the experiment are statistically significant and match the prediction, then this is taken as positive evidence for the experimental hypothesis under test.

When a large number of experimental hypotheses are confirmed, this is taken as evidence in favour of the high-level hypothesis predicting them.[8] Generally, the greater the number and variety of confirmed experimental hypotheses, the more confident researchers are in the truth of the high-level hypothesis. It is for these reasons that comparative psychologists, such as Nicola Clayton, Michael Tomasello, and Josep Call hold that animals are capable of some forms of mindreading. The predictions made by the high-level hypothesis that chimpanzees are capable of level 1 visual perspective taking, for example, has led to many positive results. These results support the high-level hypothesis, according to proponents of the current experimental approach. Let us now turn to the objection posed by the sceptics.

3. The Alternative-hypothesis Objection

The methodological sceptics argue that the above experiments do not in fact provide good evidence for mindreading because there are alternative non-mindreading hypotheses that can account for the experimental results. Over the last decade, Povinelli and colleagues have argued that these results can be explained by a “behaviour-reading hypothesis” (Povinelli and Vonk 2006; Penn et al. 2008; Penn and Povinelli 2007, 2009, 2013; Penn 2011; see also Lurz 2011 and Buckner 2013); while, recently, Cecilia Heyes has argued that they can be accommodated by a “submentailizing hypothesis” (Heyes 2014, 2015). In both cases, the sceptics hold that insofar as the results of mindreading experiments can be explained by these alternative accounts, they do not provide good evidence for mindreading. Instead, because the results are consistent with both mindreading and these alternatives, and the alternatives can be viewed as simpler than mindreading, it is these alternatives that researchers should accept.

Penn and Povinelli (2007) represent a clear example of this argument. They write that, “in order to produce experimental evidence for an fToM [theory of mind function], one must first falsify the null hypothesis that the agents in question are simply using their normal, first-person cognitive state variables” (734). They then show how one can construct an alternative non-mindreading explanation for every positive result produced by mindreading experiments. Constructing such an explanation involves more or less positing that subjects have a rule that links whatever observable variable the experimenter is manipulating with whatever dependent variable is being measured. The claim that subjects rely on such a collection of rules is what constitutes the behaviour-reading hypothesis. Given that the results of mindreading experiments are consistent with this alternative, Penn and Povinelli conclude, “the available evidence suggests that chimpanzees, corvids and all other non-human animals only form representations and reason about observable features, relations and states of affairs from their own cognitive perspective” (737).

Heyes’s general strategy is similar to Povinelli and colleagues. She writes, “all of the results published in recent years are subject to the observables problem; they could be due to mindreading, but they are at least equally likely to reflect exclusive use for social decision-making of directly observable features of the stimulus context” (Heyes 2015, 316). Heyes departs from Povinelli and colleagues, however, in advancing submentalizing as the alternative hypothesis of choice. According to this hypothesis, subjects solve mindreading tasks by employing “domain-general cognitive processes that do not involve thinking about mental states but can produce in social contexts behavior that looks as if it is controlled by thinking about mental states” (Heyes 2014, 132). Insofar as domain-general processes such as memory, attention, and perception can account for the positive results of mindreading experiments, Heyes argues, they do not provide evidence for mindreading. Heyes then shows how the best mindreading experiments can be reinterpreted in this way.

4. An Additional Criterion for Good Evidence: Severe Tests

The problem of being able to accommodate a set of data by multiple theories or hypotheses is well known in philosophy of science. Deborah Mayo refers to this as the “alternative-hypothesis objection” or “methodological underdeterminism.” She characterizes the general objection as follows: “Evidence in accordance with hypothesis H cannot really count in favour of H, it is objected, if it counts equally well for any number of (perhaps infinitely many) other hypotheses” (1996, 174).

Many philosophers now agree, however, that this objection rests on an oversimplified view of the relationship between theory and evidence. Namely, that in order for data to serve as evidence for a hypothesis, it need only be consistent with that hypothesis. The problem with this view can be illustrated with a simple example. Imagine that you would like to test the claim (H1) that running a marathon will not lead some person x to lose more than five kilograms of weight. To test this, you weigh x before and after the marathon, finding that the scale indicates 50 kilograms both times. The data fits your hypothesis and seems like good evidence for it. Now imagine that you want to test the claim (H2) that running a marathon will not lead x to lose more than half a kilogram of weight. You conduct the same test and find again that the scale reads 50 kilograms both times. This data also fits your hypothesis and seems like good evidence for it. That is, until you discover that the scale that you have been weighing x on is sensitive to changes of weight of only one kilogram or more. The collected data still fits H1 and H2, but most would contend that you only have evidence for H1. This is because if H2 were false, if x lost more than half a kilogram of weight after the marathon (600, 700, 800, 835 grams, for example) this particular scale would be unable to detect it. In contrast, if H1 were false, if x’s weight had changed by five kilograms or more, your measuring instrument would have indicated this by producing data that is discordant with H1.

This example captures the idea that producing data that fits one’s hypothesis is not enough for it to serve as good evidence for that hypothesis. Instead one must take the test procedure used to produce that data into account and ensure that it is capable of producing data that is discordant with that hypothesis when that hypothesis is in fact false. Such test procedures are what Mayo (1996) calls “severe” or “error probing.” Recognizing this additional constraint on good evidence dulls the threat of methodological underdetermination. The ability to conceive of alternative hypotheses consistent with one’s data is not enough to show that these alternatives are serious rivals with evidential support. One must also show that the tests responsible for producing that data are severe with respect to those alternatives.[9]

5. Mindreading and its Alternatives: Are They Severely Tested?

Severe tests provide an additional constraint for evaluating evidence for a hypothesis. In this section I apply this general lesson to the alternative-hypothesis objection advanced by mindreading sceptics. I argue that mindreading experiments are severe with respect to the mindreading hypothesis being tested, but not with respect to the alternative behaviour-reading and submentalizing accounts. If the behaviour-reading and submentalizing hypotheses are to be considered genuine rivals to mindreading, we must look elsewhere for their empirical support.

5.1. Mindreading Hypotheses

As we saw in section 2, a mindreading hypothesis is a high-level hypothesis that makes predictions, which serve as the basis for experimental hypotheses. In order to determine whether a high-level mindreading hypothesis has been severely tested, we must look at both stages of this process.

Let us begin with experimental hypotheses. An experimental hypothesis has been severely tested when an experiment is internally valid and the results are statistically significant. Recall that an experimental hypothesis in a mindreading experiment takes the form of a claim that there will be a difference in the dependent variable between two conditions—one in which the independent variable is present and one in which it is absent. Data that is discordant with this hypothesis would take the form of no observed difference between conditions. Now in order to determine the severity of this test, we must ask, how likely is it that it would produce such discordant data if the experimental hypothesis were false? If researchers follow standard protocol for experimental design and statistical analysis, then this likelihood is high.

As we saw, standard experimental protocol dictates that researchers control for nuisance variables by holding them constant or randomly allocating them across conditions. Successfully doing this means that the only systematic change in the dependent variable—if there is one at all—will likely be due to the independent variable. Randomized nuisance variables may accidently produce systematic effects by piling up in one condition or another, but this possibility is taken into account in the statistical analysis. Within-condition variance of the dependent variable is used to gauge how much noise is being produced by randomized nuisance variables.[10] This variance is used to determine the threshold or significance level at which one should accept the experimental hypothesis. The more variance there is, the bigger the difference required between conditions in order to conclude that the effect is due to the independent variable, rather than to the randomized nuisance variables.

In controlling for nuisance variables in this way, researchers are maximizing the probability that the data will be discordant with the experimental hypothesis, if that hypothesis were false. They are creating a situation in which, if the independent variable had no effect on the dependent variable, it would be unlikely that the data leading to the acceptance of this claim (i.e., a statistically significant difference) would be produced. Formally, for a given statistical test, this probability is 1-a, where a is the probability of falsely accepting the experimental hypothesis (that is, accepting that the independent variable produced the observed effect when in fact it did not). In psychology experiments, a is typically set to less than 5%, which means that the severity of these tests or the probability of rejecting the experimental hypothesis when it is false is greater than 95%. In other words, such tests are almost maximally severe.

The above concerns experimental hypotheses, but how does this affect the evaluation of a high-level mindreading hypothesis? The high-level hypothesis is the source of the claim that a relationship will be found between the independent and dependent variables being tested. By making such a prediction, and subjecting it to testing, the high-level hypothesis is putting itself at risk. Each experiment is an opportunity for discordant data to be produced in the form of a negative result. The more predictions that a high-level hypothesis makes and tests, the more likely it is that discordant data will be produced, if that hypothesis is false. In this way, high-level mindreading hypotheses, such as the claim that chimpanzees have level 1 visual perspective taking abilities, have been severely tested. The chance of all of those results aligning themselves in precisely the way that the mindreading hypothesis predicts, while the mindreading hypothesis is false, is unlikely.[11]

Of course, although the production of all of this concordant data is unlikely if the mindreading hypothesis were false, it is still possible. Perhaps the mindreading hypothesis accurately predicts what other animals will do in a variety of social situations, even though the hypothesis itself is wrong. Perhaps the world is organized in such a way that subjects behave as we expect mindreading agents to behave, but for reasons that have nothing to do with mindreading. These are possibilities, but they are unlikely. The number of predictions made by the mindreading hypothesis, the number of experiments run to test these predictions, how carefully controlled these experiments are—all of these things add to the severity by which a given mindreading hypothesis has been tested. When they are present, the evidence for that hypothesis is good.

5.2. The Alternative Hypotheses

Let us now turn to the alternative hypotheses of behaviour reading and submentalizing. These hypotheses fit the data produced by mindreading experiments, but are the data good evidence for them? Starting with experimental hypotheses, one might be tempted to run the same analysis as above, but an immediate problem arises. The severity by which a particular experimental hypothesis is tested matters in the mindreading case because a negative result counts as discordant data with respect to the high-level mindreading hypothesis. However, this is not the case for the alternative hypotheses of behaviour reading and submentalizing. For these alternatives, both the positive and negative results of mindreading experiments are consistent with their accounts. Indeed, proponents of these alternatives count such negative results as evidence for their views (Penn & Povinelli 2007, Heyes 2015).

That both the positive and negative results of mindreading experiments are consistent with behaviour reading and submentalizing indicates that these experiments are not severe tests with respect to these hypotheses. The fact that these experiments are internally valid and produce statistically significant results is irrelevant from the perspective of these alternatives because a negative result would not count against them. If these alternatives were false, mindreading experiments would not be able to detect it.

The fact that mindreading experiments are not severe tests with respect to behaviour reading and submentalizing is not surprising when one recognizes that these experiments were not designed to test these hypotheses. The submentalizing hypothesis, for instance, makes claims about how certain environmental objects affect an organism’s perception and memory. Heyes (2015), for example, reinterprets the positive results of a visual perspective taking experiment on ravens (Bugnyar 2011) by positing that the memory of subjects was cued in one condition by the presence of a competitor, but not cued in a second condition because the competitor appeared in a different context. This is an interesting claim, but requires an alternative experiment in order to test it—one designed to do so. Such an experiment would require choosing the right independent and dependent variables (ensuring construct validity) and controlling for nuisance variables (ensuring internal validity). For example, given the evidence in favour of visual perspective taking in corvids, one would certainly not want the visual access of a competitor to a food-hiding event to vary across conditions, as it does in this study. And the dependent variable should be a clear indicator of the disruption of “what” and “where” short-term memory. Perhaps a change detection task would be more appropriate than the choice of re-caching one of several food items in the presence of a competitor (Leising et al. 2013). I take it to be widely accepted among experimental psychologists that a well-designed experiment requires stating the hypothesis in advance and designing an experiment to test it rather than some other hypothesis. The claim that the data produced by mindreading experiments do not serve as evidence for behaviour reading or submentalizing is an extension of this point. To produce evidence for these hypotheses, one must conduct experiments designed to do so.

If mindreading experiments were not designed to test behaviour reading and submentalizing, why do they produce data that fits these hypotheses so well? Does this fit not serve as some indicator that these hypotheses reflect the way the world actually is? Here the concept of severe testing is again helpful for diagnosing the situation. Recall that a strength of the high-level mindreading hypothesis is that its predictions have survived severe tests. In contrast, the behaviour-reading and submentalizing hypotheses accommodate data as they are produced (Fletcher & Carruthers 2013). Hypotheses that are constructed on the basis of known data are referred to as “use-constructed” or “rigged” (Mayo 1996). Not all such rigged hypotheses are problematic, but they are problematic when their method of construction minimizes the chances of their being identified as false when they are in fact false. Consider, as an example of this, Mayo’s Texas sharpshooter:

Having shot several holes into a board, the shooter then draws a target on the board so that he has scored several bull’s-eyes. The hypothesis, H, that he is a good shot, fits the data, but this procedure would very probably yield so good a fit, even if H is false (1996, 201).

Such a hypothesis is not only use-constructed, but constructed in such a way that it could not have failed to fit the data, even if it were false. If the behaviour-reading and submentalizing hypotheses are rigged in this way, then their fit with the data says little to nothing about whether they are likely to be true. Thus, we must ask, if these hypotheses were false, what kind of data would indicate this and do we have the means for producing it? Both the behaviour-reading and submentalizing hypotheses do not fare well in this regard. The reason is that they are both so vaguely specified and flexible that it is not clear whether it is possible to produce data that is inconsistent with them (Fletcher & Carruthers 2013, Halina 2015). For those who hold that it is possible to produce such data (Penn & Povinelli 2007, Heyes 2015), they have not yet constructed the means for doing so. Given this, the current fit that these hypotheses have with the available data is best attributed not to their predictive and empirical success, but to the fact that they are so underconstrained that they can accommodate all of the data that comes their way. Proponents of these views have not done what is required in order to meaningfully test them—that is, test them in ways that are capable of detecting whether they are false. In this case, I would state with John Worrall that, “the ‘success’ of the theory clearly tells us nothing about the theory’s fit with Nature, but only about its adaptability and the ingenuity of its proponents” (1989, 155, emphasis original; quoted in Mayo 1996).

6. Alternative Explanations as Experimental Confounds

Mindreading experiments do not provide evidence for the alternative hypotheses of behaviour-reading and submentalizing. Why then do sceptics take these alternatives as undermining the evidence in support of particular mindreading hypotheses? Another possible reason for this is that sceptics take them as threatening the internal validity of mindreading experiments in the form of experimental confounds. Typically, experimental confounds take the form of nuisance variables that are eliminated or randomized, as discussed above. However, sceptics tend to hold that it is precisely the independent variable that is confounding the results of a given mindreading experiment. That is, they often do not dispute that the experiment has established a relationship between the independent and dependent variables, but that the mindreading hypothesis is not the best explanation for this relationship because we have good reason to think that this relationship would hold for reasons unrelated to mindreading. This is a legitimate strategy. If we have good reason to expect a relationship between two variables to obtain independently of the mindreading hypothesis being true, then testing for this relationship is not a good way of testing for mindreading. The question then is whether the behaviour-reading or submentalizing hypotheses constitute good reasons for thinking that the relationships discovered in mindreading experiments would have occurred independently of mindreading.

This is a difficult question because it hinges on what we mean by “good reason.” As we saw above, the behaviour-reading and submentalizing hypotheses cannot draw on the results of mindreading experiments for empirical support. The evidence for their plausibility has to come from elsewhere. But exactly how much support or plausibility is needed in order for something to be considered a legitimate experimental confound has not been discussed widely in the literature. Good experimentalists try to control for all factors that might affect the dependent variable, regardless of whether concrete evidence for a particular factor having such an effect has been provided. The psychologists that have been conducting mindreading experiments are no different in this respect. However, these psychologists are right to become wary when a research program is criticised through the positing of a wide range of experimental confounds that were not considered plausible until an experiment produced positive results. When this happens, one must ask what the difference is between constructively flagging experimental confounds and presenting a sceptical foil aimed at undermining evidence for a hypothesis for the sole sake of undermining it. The latter strategy is undesirable, not least because it violates what Staley (2008) identifies as Pierce’s rule: to not “block the way of inquiry” (403). Staley writes that, “to use the mere possibility of error, in the absence of any real doubt, as an obstacle to accepting the result of a sound probable inference, would be to violate Peirce’s rule” (403).

How then do we identify a legitimate confound? I propose that in order for a purported confound to be considered legitimate, it should at least have some independent empirical support—and the more support that it has, the more serious it should be taken. In psychology, it is simply too easy to come up with alternative cognitive mechanism as purported confounds. If we allowed every such alternative to be taken seriously by experimentalists, regardless of their support, research would be unable to proceed.

If we understand the behaviour-reading and submentalizing hypotheses as collections of proposed experimental confounds, then these confounds currently vary in their empirical plausibility. With respect to the behaviour-reading hypothesis, early proposals had empirical support. For example, the critique of Hare et al. (2000) that subordinate subjects might avoid the food that the dominant competitor saw because the competitor was allowed to approach that food and so might have simply scared the subordinate away from it was empirically plausible (D’Arcy & Povinelli 2002). Behavioural responses of subordinate chimpanzees to dominants were well known at the time (cite best observational studies). However, over the last ten years, as more positive results on the visual perspective taking abilities of chimpanzees and other animals have come in, and researchers have improved their experimental designs, the purported confounds cited by sceptics have become less plausible. If the confounds cited are no more than an abstract set of innate behavioural rules that have no independent empirical support, then comparative psychologists should not take them seriously.

Heyes’s submentalizing hypothesis is advanced as an improvement over the behavioural-reading hypothesis in precisely this respect (Heyes 2015). She argues that a significant weakness of the behaviour-reading account is its lack of empirical support, writing, “the vast majority of behaviour rules considered in current research on mindreading are based on common sense categories… and are not supported or constrained by empirical evidence of any sort” (321). She proposes the submentalizing hypothesis as “a better conception of ‘not mindreading’” that is “less dependent on common sense than the current conception of behaviour reading” (322). This moves us in the right direction of focusing on only those experimental confounds that have empirical support. However, the crucial point here is not a move away from “common sense”, but a move towards empirically informed alternatives. There is nothing inherently wrong with a hypothesis originating from common sense, as long as that hypothesis has been tested. There are many examples of successful hypotheses with non-scientific origins, such as Kekulé’s famous dream-inspired discovery of the structure of benzene. The problem with the behaviour-reading hypothesis is instead that it is currently a collection of untested conjectures.

This aside, the submentalizing hypothesis fares better than behaviour reading because it draws on the known domain-general cognitive abilities of organisms, such as memory, perception, and attention. Even here, though, Heyes does little to show that her proposed confounds are empirically plausible, rather than simply conceivable. For example, she argues that it is “possible” that the introduction of an opaque barrier prevents a competitor’s presence from cuing retrieval from memory the location of hidden food in Bugynar’s visual perspective taking experiment on ravens, but cites no studies showing that the introduction of such objects typically has this effect on this subject group. Before requiring that comparative psychologists undertake the laborious and expensive task of rerunning experiments with additional control conditions, sceptics should make a good case for the legitimacy of their purported confounds.

To summarize, the burden should not fall on those psychologists conducting mindreading experiments to show that they can eliminate all non-mindreading hypotheses consistent with their data. This is not possible and would needlessly block research. The burden instead falls on sceptics to show that their purported confounds are not merely sceptical foils. To do so, they must show that these confounds legitimately threaten the internal validity of a particular mindreading experiment by providing independent empirical evidence for their likely presence in the experimental context in question. Pointing to the fact that these alternative explanations fit the data produced by mindreading experiments does not constitute such evidence.

7. Conclusion

Proponents of behaviour reading and submentalizing tend to characterize these accounts as hypotheses that are equally well supported by the data produced by mindreading experiments. This rests on a misconception of what constitutes evidential support for a hypothesis: fit is not enough. The tests producing that data should also be severe or capable of detecting that the hypothesis is false when it is in fact false. Mindreading experiments are severe with respect to the high-level mindreading hypothesis being tested, but not with respect to behaviour reading and submentalizing. In order to provide experimental evidence for the latter, one must test them with experiments that were designed to do so. Such independent evidence is also required if these alternatives are to be taken as legitimate confounds in mindreading experiments.




Andrews, Kristin (2012). Do apes read minds? Toward a new folk psychology. Cambridge, MA: MIT Press.

Apperly, Ian (2011). Mindreaders: The cognitive basis of “theory of mind.” New York, NY: Psychology Press.

Baron-Cohen, Simon (1997). Mindblindness: An essay on autism and theory of mind. Cambridge, MA: MIT Press.

Buckner, Cameron (2013). The semantic problem (s) with research on animal mindreading. Mind & Language, 29(5): 566-589.

Bugnyar, Thomas (2011). Knower–guesser differentiation in ravens: Others’ viewpoints matter. Proceedings of the Royal Society B: Biological Sciences, 278(1705): 634-640.

Call, Josep, and Michael Tomasello (2008). Does the Chimpanzee Have a Theory of Mind? 30 Years Later. TRENDS in Cognitive Sciences 12 (5): 187–92.

Carruthers, Peter (2009). How we know our own minds: the relationship between mindreading and metacognition.” Behavioral and Brain Sciences, 32(2):121-138.

Clayton, Nicola (2015). Ways of thinking: From crows to children and back again. The Quarterly Journal of Experimental Psychology, 68(2): 209-241.

D’Arcy, Karen R. M. & Povinelli, D. J. (2002). Do chimpanzees know what each other see? A closer look. International Journal of Comparative Psychology, 15(1): 21-54.

Dennett, Daniel C. (1983). Intentional systems in cognitive ethology: The “Panglossian paradigm” defended. The Behavioral and Brain Sciences, 6: 343-390.

Dienes, Zoltán (2008). Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. New York: Palgrave Macmillan.

Fletcher, Logan, and Peter Carruthers (2013). Behavior-Reading versus Mentalizing in Animals. In Agency and Joint Attention, ed. Janet Metcalfe and Herbert S. Terrace, 82–99. Oxford: Oxford University Press.

Halina, Marta (2015). There is no special problem of mindreading in nonhuman animals. Philosophy of Science, 82: 473-490.

Hare, Brian, Josep Call, Bryan Agnetta, and Michael Tomasello (2000). Chimpanzees know what conspecifics do and do not see. Animal Behaviour, 59: 771-785.

Heyes, Cecilia. M. (2014). Submentalizing: I’m not really reading your mind. Perspectives on Psychological Science, 9: 131-143.

Heyes, Cecilia. M. (2015) Animal mindreading: What’s the problem? Psychonomic Bulletin and Review, 22(2): 313-27.

Leising, Kenneth J., L. Caitlin Elmore, Jacquelyne J. Rivera, John F. Magnotti, Jeffrey S. Katz, and Anthony A. Wright (2013). Testing visual short-term memory of pigeons (Columba livia) and a rhesus monkey (Macaca mulatta) with a location change detection task. Animal Cognition, 16: 839-844.

Liebal, Katja, Simone Pika, Josep Call, and Michael Tomasello (2004). To Move or Not to Move: How Apes Adjust to the Attentional State of Others. Interaction Studies, 5(2): 199–219.

Lurz, Robert (2011). Mindreading Animals: The Debate over What Animals Know about Other Minds. Cambridge, MA: MIT Press.

Mayo, Deborah G. (1996). Error and the growth of experimental knowledge. Chicago: The University of Chicago Press.

Mayo, Deobrah G., & Spanos, Aris (Eds.) (2008). Error and inference: Recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science. New York, NY: Cambridge University Press.

Meketa, Irina (2014). A critique of the principle of cognitive simplicity in comparative psychology. Biology & Philosophy, 29: 731-745.

Melis, Alicia P., Josep Call, and Michael Tomasello (2006). Chimpanzees (Pan troglodytes) Conceal Visual and Auditory Information from Others. Journal of Comparative Psychology, 120(2): 154–62.

Penn, Derek. C. (2011). How folk psychology ruined comparative psychology and what scrub jays can do about it. In R. Menzel, & J. Fischer (Eds.), Animal Thinking: Contemporary Issues in Comparative Cognition (pp. 253–265). Cambridge, MA: MIT Press.

Penn, Derek C., and Daniel J. Povinelli (2007). On the lack of evidence that non-human animals possess anything remotely resembling a ‘theory of mind’. Philosophical Transactions of the Royal Society B-Biological Sciences, 362(1480): 731-744.

Penn, Derek. C., & Povinelli, Daniel. J. (2009). On becoming approximately rational: The relational reinterpretation hypothesis. In S. Watanabe, A. P. Blaisdell, L. Huber, & A. Young (Eds.), Rational Animals, Irrational Humans. Tokyo, Japan: Keio University Press.

Penn, Derek. C., & Povinelli, Daniel. J. (2013). The comparative delusion: Beyond behavioristic and mentalistic explanations for nonhuman social cognition. In H. S. Terrace & J. Metcalfe (Eds.), Agency and joint attention. New York, NY: Oxford University Press.

Penn, Derek. C., Holyoak, K. J., & Povinelli, DJ. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31(2), 109-178.

Povinelli, Daniel J., Nelson, Kurt E., and Boysen, Sarah T. (1990). Inferences about guessing and knowing by chimpanzees (Pan troglodytes). Journal of Comparative Psychology, 104(3): 203-210.

Povinelli, Daniel J., and Timothy J. Eddy. (1996). “What Young Chimpanzees Know about Seeing.” Monographs of the Society for Research in Child Development 61 (3): 1-152.

Povinelli, Derek. J., & Vonk, Jennifer. (2006). We don’t need a microscope to explore the chimpanzee’s mind. In S. Hurley, & M. Nudds (Eds.), Rational Animals? (pp. 385–412). New York, NY: Oxford University Press.

Premack, David, and Guy Woodruff (1978). “Does the Chimpanzee Have a Theory of Mind?” Behavioral and Brain Sciences 1(4): 515-26.

Sani, Fabio and John Todman. (2006). Experimental Design and Statistics for Psychology: A First Course. Oxford: Blackwell Publishing.

Shettleworth, Sara J (2010). Clever animals and killjoy explanations in comparative psychology. Trends in Cognitive Sciences, 14(11): 477-481.

Sober, Elliott (2001). The principle of conservatism in cognitive ethology. Royal Institute of Philosophy Supplement, 49: 225-238.

Staley, Kent (2008). Error-statistical elimination of alternative hypotheses. Synthese, 163: 397-408.

Tomasello, Michael, and Josep Call. (2006). “Do Chimpanzees Know What Others See—or Only What They Are Looking At?” In Rational Animals?, ed. Susan Hurley and Matthew Nudds, 371–84. Oxford: Oxford University Press.

Worrall, John (1989) Fresnel, Poisson and the white spot: The role of successful predictions in the acceptance of scientific theories. In David Gooding, Trevor Pinch, and Simon Schaffer (Eds.) The uses of experiment: Studies in the natural sciences. Cambridge, UK: Cambridge University Press.




[1] A previous version of this paper was presented to the Cambridge Comparative Cognition Lab. Thanks to the members of that group for their helpful feedback and discussion, especially Lucy Cheke, Nicky Clayton, Ed Legg, Corina Logan, and Ljerka Ostojic. A very special thanks to Kristin Andrews, Irina Mikhalevich, and Robert Lurz for agreeing to comment on this paper for Minds Online—thank you!

[2] A somewhat similar distinction has been made between “romantics” and “killjoys” and “boosters” and “scoffers” (Dennett 1983; Tomasello and Call 2006; Shetttleworth 2010).

[3] I borrow this term from Mayo 1996 (see below).

[4] See Sober 2001, Lurz 2011 (especially section 2.7), and Meketa 2014 for discussion of the latter claim.

[5] See Premack and Woodruff 1978, Povinell et al. 1990, and Povinelli and Eddy 1996 for pioneering work in this area.

[6] See Sani and Todman 2006 and Dienes 2008 for general introductions to experimental design.

[7] See Mayo 1996 and Staley 2008. Sani and Todman 2006 also refer to the former as a “theory” and the latter as a “testable hypothesis.”

[8] I use “confirm” loosely here to mean, “judged to be supported by the data.”

[9] One can formally evaluate the severity of a test by calculating the probability of obtaining a given set of data on the assumption that the hypothesis being tested is false. However, I will undertake a more qualitative analysis here. A formal analysis is not always possible with high-level hypotheses and the particular statistical methods used to evaluate experimental hypotheses in psychology often vary from experiment to experiment. Given this, a general, qualitative analysis is more appropriate and, as we will see, adequate for evaluating the claims being considered here.

[10] Recall that the independent variable does not vary within a condition, so all within-condition variance should be due to randomized nuisance variables.

[11] A worry here is that positive experimental results in psychology (and science generally) are over-represented because of the unwillingness of journals to publish negative results. I will set this worry aside here, but it is a legitimate one—thanks to Lucy Cheke for flagging it.

16 thoughts on “Inference and Error in Comparative Psychology: The Case of Mindreading”

  1. The question of whether some nonhuman animals understand that others have a mental life of their own (what is known as mindreading or theory of mind) has received a great deal of theoretical attention in recent decades in comparative psychology and philosophy. And, yet, the answer seems farther now than ever, with some scientists and philosophers now worrying that the question may be empirically intractable. Halina disagrees with this pessimistic conclusion, calling its proponents methodological sceptics and arguing that the scepticism rests on a misunderstanding of the relationship between experiment and evidence. This misunderstanding arises from what she calls the “alternative-hypothesis objection” to mindreading, which consists in offering reinterpretations of experimental data on the assumption that these reinterpretations undermine the original inference by offering alternative hypotheses that account for the results as well as the mindreading hypothesis. The trouble with this approach, Halina argues, is that such reinterpretations are merely empirically adequate – they only fit the data – but fit with data is not enough to produce genuine underdetermination of a theory by the evidence. She is right: underdetermination cannot be produced by just any hypothesis – it must have some degree of plausibility, or else hypotheses that rely on telepathy or alien abductions could undermine well- tested, naturalistic hypotheses, which would surely be absurd. Halina enlists Deborah Mayo’s principle of “severe testing” as a requirement for evidence from experiment, on which a hypothesis counts as evidence only when the methodology could have proven the hypothesis false were it in fact false. In the case of mindreading experiments, the alternatives fail the severity analysis – in part because the tests were not designed to test them. Halina concludes that these alternatives could only undermine the evidentiary status of the mindreading hypotheses if they had independent empirical support, which they currently lack.

    I am sympathetic with many of her conclusions, but I would like to argue that Halina’s emphasis on and interpretation of the “alternative hypothesis objection” misrepresent the sceptical project, with consequences for her suggested corrective.

    First, few sceptics are guilty, as she claims, of assuming an overly simplified view of underdetermination on which the mere presence of competing hypotheses undermines the evidentiary strength of the hypothesis being proposed. Consider Halina’s two main targets – Heyes’s (2014) “submentalizing” hypothesis and Povinelli and colleagues’ “behaviour-reading” hypothesis. Heyes writes that if her account is correct, then “…it suggests that… humans do not need mentalizing as much as previously thought” (Heyes 2014, 132; emphasis added). As I have written elsewhere (Meketa 2014), this locution of not needing to posit X when Y can explain the phenomena suggests that the burden of proof is on X. If Heyes is comfortable making burden of proof claims, charity requires that we seek out her reasons and address these directly. Povinelli and colleagues, who argue that mindreading does no explanatory work over and above that of behaviour-reading, similarly hold mindreading to bear the burden of proof. This is in part because, on the sceptical view, mindreading is a species of more “complex” hypothesis, and simpler hypotheses such as behaviour-reading enjoy the privilege of default status whenever a more complex hypothesis is proffered. Yet other sceptics, such as Karin-D’Arcy (2005), appeal to evolutionary theory to argue that more complex hypotheses bear the burden of proof.

    Thus, the sceptics are not simply offering empirically adequate alternatives as a strategy for unseating the mindreading hypothesis. Rather, they take their hypotheses as the theoretical defaults that any mindreading experiment must dislodge. Thus, on this sceptical view, if a mindreading experiment is incapable of ruling out the non-mindreading hypothesis—as the severe testing procedure demands – then this is a problem for the experiment and not for the “alternative” hypothesis. If they are right, then the sceptics do not need to show that their hypotheses can pass with severity since their hypotheses already and independently enjoy the status of epistemic superiority. The appeal to simplicity is not, then, a reason to prefer the alternative hypothesis against a particular mindreading hypothesis in a given experiment, but a reason to treat all instances of such “simpler” alternatives as the defaults. Indeed, Halina notes that sceptics often appeal to such things as the virtue of simplicity when arguing for the superiority of their alternatives (Halina 2015, 2). This captures the dialectic, but not the underlying strategy of the sceptics and so she cannot, as she does, simply set the question of the plausibility of such appeals aside.

    Given this, what work is left for the severity analysis to do? One suggestion is for Halina to further develop her Popperian idea that high-level hypotheses such as “mindreading” are partially corroborated whenever they survive risky tests – that is, whenever their predictions pass “severe” experimental tests. To give a complete picture, it may be helpful to include a discussion of other means by which these high-level hypotheses are corroborated – namely, through their conformity with established theories in related domains, such as evolutionary biology. Such a discussion would offer the resources to explain when and under what conditions certain types of high-level hypothesis rightly attain the status of the default hypothesis. If Halina can show that the “simpler” sceptical hypotheses are not entitled to this status, then these alternative hypotheses would, indeed, fail to undermine the mindreading experimental inferences on the analysis she offers. This is because theoretical justifications such as appeals to simplicity do not ground specific high-level hypotheses such as behaviour-reading, but restructure research programs by, e.g., requiring all mindreading-related experiments to be capable of ruling out “simpler” alternatives or fail to count as evidence for mindreading. Shifting the burden of proof away from the mindreading hypotheses would block this sceptical move, allowing experiments to count as evidence for mindreading even when they are empirically insensitive to, and hence cannot rule out, “simpler” alternative explanations.

    Insofar as Halina examines the statistical basis for inference from experiment in comparative psychology, argues that the traditional default hypotheses are not supported by the evidence, and concludes that more empirical support is necessary for the sceptical project to go through, her project is consonant with and enriches those of other philosophers who find flaws with the methodology of the sceptical project (e.g., Andrews and Huss (2014), Fitzpatrick (2009), Meketa (2014), Mikhalevich (forthcoming), and Sober (2009)). Her novel contribution lies in bringing the severe testing requirement to bear on hypothesis choice in the mindreading debate. This insight is important and may be profitably extended to other areas of experimental comparative psychology.


    Andrews, Kristin and Brian Huss. 2014. “Anthropomorphism, Anthropectomy, and the Null Hypothesis.” Biology and Philosophy 29:711-29

    Fitzpatrick, Simon. 2009. “The primate mindreading controversy: a case study in simplicity and methodology in animal psychology.” In The Philosophy of Animal Minds, edited by Robert W. Lurz, 237 – 257. New York: C.U.P.

    Halina, Marta. (In preparation). Inference and Error in Comparative Psychology: The Case of Mindreading

    Heyes, Cecilia. M. 2014. Submentalizing: I’m not really reading your mind. Perspectives on Psychological Science, 9: 131-143.

    Karin-D’Arcy, M, 2005. The Modern Role of Morgan’s Canon in Comparative Psychology.
    International Journal of Comparative Psychology, 18(3)

    Mayo, Deborah & Aris Spanos. 2011. “Error Statistics,” In Handbook of the Philosophy of Science, Volume 7: Philosophy of Statistics, ed. Prasanta S. Bandyopadhyay and Malcolm R. Forster, 153-198. Philadelphia: Elsevier Inc.

    Meketa, Irina. 2014. A critique of the principle of cognitive simplicity in comparative psychology. Biology & Philosophy, 29: 731-745.

    Mikhalevich, Irina (forthcoming) Experiment and Animal Minds: Why the Choice of the Null Hypothesis Matters. Philosophy of Science.

    Penn, Derek C., and Daniel J. Povinelli. 2007. On the lack of evidence that non-human animals possess anything remotely resembling a ‘theory of mind’. Philosophical Transactions of the Royal Society B-Biological Sciences, 362(1480): 731-744.

    Povinelli, Derek. J., & Vonk, Jennifer. 2006. We don’t need a microscope to explore the chimpanzee’s mind. In S. Hurley, & M. Nudds (Eds.), Rational Animals? (pp. 385–412). New York, NY: Oxford University Press.

    Sober, Elliot. 2009. “Parsimony and Models of Animal Minds.” In The Philosophy of Animal Minds, ed. Robert W. Lurz, 237-257. New York: C.U.P.

  2. I’m delighted to have been invited to comment on Marta Halina’s very interesting and original paper “Inference and error in comparative psychology: The case of mindreading.” The animal mindreading debate has been something I’ve been thinking and writing about for some years now, and I am thrilled to be given the opportunity to think through an exciting new idea about the debate and how it should be settled. I would like to start by first providing a rather basic description of the difference between the animal mindreading and animal behavior-reading hypotheses, and then move on to examine some aspects of Halina’s view.

    For countless species of nonhuman animal (hereafter just ‘animal’), the ability to predict the behavior of other animals is vital to their wellbeing and reproductive success. A defining question in animal social cognition research is how do animals make such predictions? There are two very general and opposing hypotheses in the field that attempt to answer this question. According to the behavior-reading hypothesis, all animals that are capable of predicting the behavior of others do so by means of perceptual and cognitive processes that range over non-mentalistic representations of behavioral and environmental cues and relations. Some of these cues and relations can be rather specific, such as ‘torso facing forward’ or ‘hair bristling,’ while others can be more abstract, such as ‘threat display’, ‘line of gaze, or ‘manipulating an object in the most efficient way within the constraints of the setting’. What makes these representations of such behavioral and environmental cues and relations non-mentalistic is that the animal can represent them as such without having any understanding of the mental states that may be causing or associated with them in other agents or themselves. In contrast, the mindreading hypothesis holds that some species of animal (e.g., apes, corvids, dogs, and dolphins) that are capable of predicting the behavior of other animals do so by representing and reasoning over mental states. These mental states can include perceptual states such as seeing and hearing; affective and motivational states such as fearing, desiring, intending, and willing; and cognitive states such as knowing, believing, and inferring. A mindreading animal, for example, might predict aggressive behavior from a conspecific because it understands that the threat display of the conspecific means that the conspecific feels threatened and is likely to attack. An animal that makes the same prediction of aggressive behavior but without understanding the threat display as meaning something about how the conspecific feels is a behavior-reader.

    In recent years, a number of mindreading experiments have yielded data that are consistent with the animal mindreading hypothesis (for reviews, see Call & Tomasello, 2008; Lurz, 2011). Some researchers have taken the data from these experiments as providing “solid evidence” for the mindreading hypothesis and against the behavior-reading hypothesis (Call & Tomasello, 2008, p. 187). Other researchers (the ‘skeptics’) point out that the data can be equally well accounted for on the behavior-reading hypothesis and, as a result, do not provide any more evidence for the mindreading hypothesis than for the behavior-reading hypothesis (Heyes, 2014; Lurz, 2009; Lurz, Kanet & Krachun, 2014; Lurz & Krachun, 2010; Penn & Povinelli, 2007; Povinelli & Vonk, 2003).

    Marta Halina argues that the skeptics’ reasoning rests upon a mistaken view of evidence. The correct view, according to Halina, is Deborah Mayo’s error-statistical view of evidence (Mayo, 1996). On this view, data that ‘fit’ a hypothesis are evidence for it just in case the data are produced by severe testing procedures, where severe testing procedures for a hypothesis, according to Halina, are procedures “capable of detecting whether that hypothesis is false when it is in fact false.” Halina argues that the studies that have yielded data consistent with the animal mindreading hypothesis employ severe tests for the mindreading hypotheses but not for the behavior-reading hypothesis and, as a result, the data from these tests provide evidence for the former hypothesis but not the latter hypothesis, contrary to what skeptics claim.

    So what are skeptics, such as myself, to say? Well, one thing that might be pointed out is that Mayo’s error-statistical view is not the only credible view of evidence around. There are other credible views of evidence in which severe tests are not required for data to count as evidence for a hypothesis (e.g., the positive-relevance account, Achinstein’s account, Gylmour’s bootstrapping account, etc.). Skeptics are in their right to ask for some compelling reason to accept the error-statistical view over other equally credible views of evidence that do not require severe tests. Halina’s argument against the skeptics, therefore, would be stronger and more convincing if there were such a reason. However, I understand that providing such a reason lies outside the scope of Halina’s paper. So it is unfair to ask for it here. For the sake of argument, then, I will accept the error-statistical account of evidence as correct.

    Taking the error-statistical account of evidence as correct, I now want to look more closely at Halina’s claim that the tests that have yielded data consistent with the mindreading hypothesis are severe for that hypothesis but not for the behavior-reading hypothesis. I cannot, of course, look at all of these tests; instead, I would like to look at one test in particular that I think is highly representative of the rest.

    Perhaps the best-known and most influential mindreading test that has yielded data consistent with the animal mindreading hypothesis is Hare, Call, Agnetta, and Tomasello (2000). In Hare et al.’s (2000), a subordinate and dominant faced-off over two pieces of food placed in a central cage. One of the pieces of food was hidden on the subordinate’s side of an opaque barrier, allowing the subordinate but not the dominant to see the food, while the other piece of food was out in the open, allowing both chimpanzees to see the food. Hare and colleagues found that subordinate chimpanzees were more likely to obtain the food behind the opaque barrier than the food out in the open, and they were more likely to make their first move toward the food behind the opaque barrier than the food out in the open. Hare and colleagues interpret these results as evidence that chimpanzees understand the behavioral significance of seeing in conspecifics – that during their ontogeny subordinate chimpanzees learn that dominant conspecifics are more likely to go for food that they see than food they do not see. Hare and colleagues’ seeing hypothesis is an example of a mindreading hypothesis, since seeing is a state of awareness, and awareness is a mental state. Other researchers, myself included, have pointed out that the findings from Hare et al. (2000) are just as plausibly explained on the hypothesis that chimpanzees understand the behavioral significance of line of gaze in conspecifics – that during their ontogeny subordinate chimpanzees learn that dominant conspecifics are more likely to go for food in their line of gaze than food behind opaque barriers (Heyes, 2014; Lurz, 2009; Penn & Povinelli, 2007; Perner, 2008; Povinelli & Vonk, 2003; Whiten, 2013). The line-of-gaze hypothesis is an example of a behavior-reading hypothesis, since line of gaze is a (non-mental) spatial relation between an agent’s eyes and objects in the agent’s environment.

    On Halina’s view, the data from Hare et al.’s (2000) study provide evidence for the seeing hypothesis but not the line-of-gaze hypothesis because the testing procedures used were severe for the former, but not for the latter, hypothesis. Recall that severe testing procedures for a hypothesis are those that are “capable of detecting whether that hypothesis is false when it is in fact false.” Applying this idea to Hare et al.’s (2000) study, we get the view that the testing procedures in Hare et al. (2000) were such that

    (a) had the seeing hypothesis been false, the chimpanzees would have likely behaved in ways inconsistent with what the seeing hypothesis predicts; but
    (b) had the line-of-gaze hypothesis been false, the chimpanzees would have likely behaved in ways consistent with what the line-of-gaze hypothesis predicts.

    Although Halina does not speak directly about Hare et al.’s (2000) study, she says things that imply that, on her view, (a) and (b) are correct, making the data from Hare et al.’s (2000) study evidence for the seeing, but not the line-of-gaze, hypothesis. However, the application of Halina’s view to Hare et al.’s (2000) study creates a problem. The problem, simply stated, is that the condition under which (a) is correct is the very condition under which (b) is incorrect. To see this, it’s important to note that there are two ways for the seeing hypothesis to be false. One way the hypothesis could be false is if the line-of-gaze hypothesis were true. If subordinate chimpanzees predict the dominant’s behavior simply on the basis of what food the dominant has or does not have a line of gaze to, then it would be false that they predict the dominant’s behavior on the basis of what food the dominant sees or does not see. But if the seeing hypothesis were false because the line-of-gaze hypothesis were true, then subordinate chimpanzees would behave in exactly the way the line-of-gaze hypothesis predicts – that is, they would prefer to take the food behind the opaque barrier to which the dominant lacks a line of gaze. And this is also exactly what the seeing hypothesis predicts the subordinates would do on Hare et al.’s (2000) test. Hence, if the seeing hypothesis were false because the line-of-gaze hypothesis were true, then the subordinate chimpanzees would not have behaved in ways inconsistent with what the seeing hypothesis predicts. Thus, (a) is incorrect if the seeing hypothesis is false as a result of the line-of-gaze hypothesis being true.

    The other way that the seeing hypothesis could be false is if the line-of-gaze hypothesis were also false. That is, the seeing hypothesis could be false because the subordinate chimpanzees were using neither seeing nor line-of-gaze to predict the dominant’s behavior. If the seeing hypothesis were false in this way, then the subordinate chimpanzees may well behave in ways inconsistent with what the seeing hypothesis predicts. For example, the subordinate chimpanzees may have ended up showing no significant preference for the food behind the occluder, or showing a significant preference for the food out in the open. But either of these results would also be inconsistent with what the line-of-gaze hypothesis predicts. Hence, had the line-of-gaze hypothesis been false under this sort of condition (i.e., when the seeing hypothesis was also false), Hare et al.’s (2000) test would have detected it, which means (b) is incorrect. Either way, there does not seem to be a condition in which the seeing hypothesis is false and (a) and (b) are both correct. Thus, either Hare et al.’s (2000) test is not severe for the seeing hypothesis; or it is severe for the seeing hypothesis, but under the condition in which it is severe for the seeing hypothesis, it is also severe for the line-of-gaze hypothesis.

    I believe that all the other mindreading tests that have yielded data consistent with the animal mindreading hypothesis succumb to the same sort of problem. In other writings, I have tried to show that in these other mindreading tests there is an antecedently plausible behavior-reading hypothesis (such as the line-of-gaze hypothesis) that is also consistent with the data. What are needed are more sensitive tests in which the mindreading hypothesis and the antecedently plausible behavior-reading hypothesis make different predictions about how the animals will perform on the test. Such tests are difficult to design, which perhaps explains why they have not yet been run. However, I am optimistic that such tests can be designed and run (see Lurz, 2011; Lurz, Kanet & Krachun, 2014; Lurz & Krachun, 2011).

    In the end, I remain skeptical about whether the data from Hare et al.’s (2000) study, or the data from any other mindreading study with which I am familiar, is unequivocal evidence for the animal mindreading hypothesis.


    Barth, J., Reaux, J. and Povinelli, D. (2005). Chimpanzees’ (Pan troglodytes) use of gaze cues in object-choice tasks: different methods yield different results. Animal Cognition, 8, 84-92.

    Call, J. and Tomasello, M. (2008) Does the chimpanzee have a theory of mind? 30 years later. Trends in Cognitive Sciences, 12, 187–92.

    Hall, K. et al. (2014). Using cross correlations to investigate how chimpanzees (Pan troglodytes) use conspecific gaze cues to extract and exploit information in a
foraging competition. American Journal of Primatology, 76, 932-941.

    Hare, B., Call, J., Agnetta, B. and Tomasello, M. (2000). Chimpanzees know what con- specifics do and do not see. Animal Behavior, 59, 771–85.

    Heyes, C. (2014). Animal mindreading: What’s the problem. Perspectives on Psychological Science, 9, 131-143.

    Lurz, R. (2009). If chimpanzees are mindreaders, could behavioral science tell? Toward a solution of the logical problem. Philosophical Psychology, 22, 305-328.

    Lurz, R. (2011). Mindreading animals: The debate over what animals know about other minds. Cambridge, MA: MIT Press.

    Lurz, R. and Krachun, C. (2011). How could we know whether nonhuman primates understand others’ internal goals and intentions? Solving Povinelli’s problem. Review of Philosophy and Psychology, 2, 449–81.

    Lurz, R., Kanet, S. and Krachun, C. (2014). Animal mindreading: A defense of optimistic agnosticism. Mind & Language, 29, 428-454.

    Mayo, D. (1996). Error and the growth of experimental knowledge. Chicago, IL: The University of Chicago Press.

    Okamoto-Barth, S., Call, J. & Tomasello, M. (2007). Great apes’ understanding of other individuals’ line of sight. Psychological Science, 18, 462-468.

    Penn, D. and Povinelli, D. (2007). On the lack of evidence that animals possess anything remotely resembling a ‘theory of mind.’ Philosophical Transactions of the Royal Society B, 362, 731-744.

    Perner, J. (2008). Who took the cog out of cognitive science? Mentalism in an era of anti-cognitivism. In P.

    A. Frensch and R. Schwarzer (eds), International Congress of Psychology: 2008 Proceedings. Hove: Psychology Press, 241–61.

    Povinelli, D. and Vonk, J. (2003). Chimpanzee minds: suspiciously human? Trends in Cognitive Science, 7, 157-160.

    Whiten, A. (2013). Humans are not alone in computing how others see the world. Animal Behaviour, 86, 213-221.


    1 I take Heyes’ submentalizing hypothesis to be a type of behavior-reading hypothesis.

    2 Perhaps, the animal has simply experienced threat displays of this sort followed by attacking behavior in the past; or perhaps the ‘threat display  attacking behavior’ is an instinctive behavior pattern in the species that the animal understands innately.

    3 Halina appears to provide a different definition of ‘severe testing procedures for a hypothesis’ later in her paper when she writes that a “hypothesis has been severely tested when an experiment is internally valid and the results are statistically significant.” On this definition, however, it is not clear why a behavior-reading hypothesis (e.g., line-of-gaze hypothesis) is not severely tested by an internally valid mindreading experiment (e.g., Hare et al.’s (2000) experiment) with statistically significant results that are consistent with the behavior-reading hypothesis. In reply, Halina writes that the fact that a mindreading experiment is internally valid and produces statistically significant results consistent with a behavior-reading hypothesis does not show that the behavior-reading hypothesis was severely tested by the experiment “because a negative result would not count against [the behavior-reading hypothesis].” This reply, however, simply appeals to the first definition of ‘severe testing procedures for a hypothesis.’ Thus, I will assume that by ‘severe testing procedures for a hypothesis,’ Halina means a testing procedure capable of detecting that the hypothesis is false when it is in fact false.

    4 A number of studies have shown that chimpanzees follow other subjects’ line of gaze to distal objects, and that they can use line of gaze in humans and conspecifics to find hidden food (Barth, Reaux & Povinelli, 2005; Hall et al., 2014; Okamoto-Barth, Call & Tomasello, 2007;). So it is antecedently plausible to suppose subordinate chimpanzees use line of gaze in Hare et al.’s (2000) study to predict the dominant’s behavior.

    5 Line of gaze is a spatial relation between objects and agents. Roughly, an agent has a line of gaze to an object just in case the object is in front of the agent’s eyes and not behind an opaque barrier. Seeing an object involves more than simply having a line of gaze to it. This is evident from the fact that we do not see everything that is in our line of gaze (e.g., dust, ultraviolet light, glass doors, white dots on white backgrounds, etc.).

    6 Halina says things that imply that she believes that such “negative results would not count against” the line-of-gaze hypothesis. However, I am afraid I do not understand this. The line-of-gaze hypothesis predicts a certain behavior pattern from the subordinate chimpanzees, just as the seeing hypothesis does. In fact, they both predict the same behavior pattern from the subordinate chimpanzees. If the subordinate chimpanzees do not behave in the way predicted, if they show no significant preference for the food behind the opaque barrier or show a significant preference for the food out in the open, then those negative results would count as much against the seeing hypothesis as the line-of-gaze hypothesis.

  3. The question at stake in Halina’s paper is whether the experimental data offers evidence for the mindreading hypothesis, but not alternative hypotheses (for simplicity’s sake, but with reluctance, I will focus on the behavior-reading alternative). While Halina argues that it does, because the research serves as a severe test for the mindreading hypothesis but not for the behavior-reading hypothesis, I’m not convinced. I have argued that the problem of deciding between these two hypotheses may be the problem of other minds in sheep’s clothing (Andrews 2012), and that worry remains.

    As I understand it, Halina’s central argument goes something like this:

    1. A hypothesis has been subjected to a severe test when an experiment is internally valid and the results are statistically significant.
    2. The experiments testing the mindreading hypothesis are internally valid–because they controlled for nuisance variables–and are statistically significant.
    3. Therefore, the mindreading hypothesis has been subjected to a severe test.
    4. The behavior-reading hypothesis has not been subjected to a severe test, because it is compatible with both passing and failing the tests.
    5. Data is only good evidence for hypotheses subjected to a severe test.
    6. Therefore, the mindreading data is only good evidence for the mindreading hypothesis.

    Let’s grant P5, even if it appears overstrong, for the sake of discussion.

    P4 is also worth noting, not because it is false, but because the mindreading hypothesis is also compatible with both passing and failing certain mindreading tests. Failures to elicit mindreading behavior may be due to motivation of the subject(s), ecological validity, perceptual capacities, complexity of the test, biases, etc.. When pilot testing, these independent variables are often manipulated in order to elicit the desired behavior.

    Furthermore, those who think that infants mindread do not take infants’ failures to pass the verbal false belief task as evidence against infant mindreading; infant mindreading is compatible with their failing the classic false belief task. But, importantly, one has to fail the right sort of tasks, in the right sort of way. It isn’t that “proponents of these alternatives count [any] such negative results as evidence for their views (Penn & Povinelli 2007, Heyes 2015)” (7) or that “both the positive and negative results of mindreading experiments are consistent with behaviour reading” (7). Indeed, Penn and Povinelli make it clear that they are looking for something like signature limits in their attention to both successes and failures when they write, “it is the pattern of successes and failures on different conditions in our protocol that is likely to provide the most interesting evidence concerning the cognitive strategy being employed by a given non-human subject” (2007, 741). Careful attention to the patterns of failures and successes have helped us understand the shape of the hidden mechanisms at work in cognitive systems when it comes to numerocity and analogue magnitudes in children and animals, for example (see Gallistel 1990, Carey 2009, Beck 2012), and the same strategy is at play in investigating the shape of the mechanisms involved in social prediction (Apperly and Butterfill 2009, Butterfill and Apperly 2013, Brown and Taylor, in prep). The reason why the behavior-reading hypothesis hasn’t been subjected to a severe test isn’t because it is compatible with failing some of the tests designed to examine the mindreading hypothesis. The reason is that, I’m afraid, neither mindreading nor behavior-reading can be subjected to a very severe test. To present this worry I need to go into some detail about the nature of a severe test, and the close relationship between the hypotheses under discussion.

    Halina defines a severe test as “a testing procedure capable of detecting whether that hypothesis is false when it is in fact false” (2) and notes that “An experimental hypothesis has been severely tested when an experiment is internally valid and the results are statistically significant” (6). Note that as stated a severe test is one that serves as a negative test, not a positive test or decision procedure. A severe test is one that will give true negatives, but could also give false positives.

    Note too that Mayo describes a severe test as one that is relative to a particular hypothesis (1997). She gives the example of a diagnostic test of some disease that has a high rate of false positives, and says that getting a negative in such a test is a severe test of the hypothesis that the individual doesn’t have the disease, but getting a positive isn’t a severe test of the hypothesis that the individual does have the disease. A positive result would not constitute passing a severe test in this case.

    Analogously, in the mindreading case, if we have a diagnostic test of the existence of a particular mechanism that has a high rate of false positives and a low rate of false negatives, getting a negative would serve as a severe test of—and good evidence for—the hypothesis that the individual lacks that mechanism. But getting a positive wouldn’t be a severe test of the hypothesis that the individual does have the mechanism.

    Tests are severe or not with regard to a particular hypothesis, e.g., that the disease is present, but that same test need not be severe with regard to the negation of that hypothesis, e.g. that the disease is absent.

    Now, turning to the two hypotheses on the table in the chimpanzee mindreading debate, we have:

    Hb: the mechanism is behavior reading without mindreading.
    Hm: the mechanism is behavior reading plus mindreading

    We need to formulate the hypotheses in this way given that a mindreader must also be a behavior reader—a mindreader knows what flavor of mental states go along with, or cause, the behaviors that are exhibited by a target, and in virtue of that they are able to categorize behaviors as certain types, and as associated with various other behaviors.

    Halina argues that for there to be evidence in favor of Hb, we need a severe test of Hb, because the mindreading tests designed to test Hm are severe tests for that hypothesis, but not for Hb. A severe test for the hypothesis Hb would be one that could detect when that hypothesis is false when it is false. In what ways could this hypothesis be false? It could be false if there is both behavior-reading and mindreading (as the Hm would have it). It could be false if there is no behavior reading going on at all (which is also going to make Hm false given that mindreading requires behavior reading; and I have to say I cannot imagine what such a test could possibly look like). So a severe test for the Hb hypothesis that doesn’t also eliminate Hm would also be a severe test for the hypothesis ~Hm.

    However, a severe test for ~Hm would take the form of the mindreading test that requires an independent verification of the rate of false positives for mindreading. But we don’t have a severe test for ~Hm because we just don’t know whether the mindreading tests have a high rate of false positives. Those who argue that there are viable alternative hypotheses would claim that mindreading tests do have a high rate of false positives. If that is right then, and even worse, passing a mindreading test isn’t passing a severe test either. The design of the kind of severe tests necessary to decide between the hypotheses requires the very knowledge that we currently lack.

    Futhermore, the severity criterion doesn’t help to adjudicate the debate between Hm and Hb because within a particular experiment, to be a severe test it is necessary (and sufficient?) that the experiment be internally valid with regard to the hypothesis being tested. Complete internal validity would guarantee that no variable other than the hypothesized one would be implicated in causing the dependent variable. In the case of the debate between Hb and Hm, the perceptual information could be causally implicated in causing the dependent variable in two different ways. Despite our best efforts, no mindreading experiment has been conceived that succeeds in controlling for the variables that causally relate to the Hb hypothesis (Andrews 2015). This is because we cannot control for such a variable on nonverbal tasks, given that a mindreader must read behavior, and the same perceptual information is the relevant variable for the dependent variable when investigating both Hm and Hb. Internal validity is arguably a stronger criterion than a severe test, for it takes into account all the possible causal relations between the independent variables in the experimental context. This worry challenges P2 of Halina’s argument, since if the nuisance variables cannot be controlled for when it comes to the challenge of Hb, then Hm has not been subjected to a severe test either.

    Apperly, I. A., & Butterfill, S. A. (2009). Do humans have two systems to track beliefs and belief-like states? Psychological Review, 116(4), 953–970.

    Andrews, K. (2012). Review of Lurz Mindreading Animals. Notre Dame Philosophical Review. March 30, 2012. http://ndpr.nd.edu/news/29824-mindreading-animals-the-debate-over-what-animals-know-about-other-minds/

    Andrews, K. (2015). The Animal Mind: An Introduction to the Philosophy of Animal Cognition. Abingdon, Oxon: Routledge.

    Beck, J. (2012). The Generality Constraint and the Structure of Thought. Mind, fzs077. 

    Brown, R. and Taylor, in prep.

    Butterfill, S. A., & Apperly, I. A. (2013). How to Construct a Minimal Theory of Mind.Mind & Language, 28(5), 606–637. 

    Carey, S. (2009). The Origin of Concepts (Reprint edition). Oxford: Oxford University Press.

    Gallistel, C. R. (1990). The Organization of Learning. Cambridge, Mass.: A Bradford Book.

    Mayo, D. (1997). Severe tests, arguing from error, and methodological underdetermination. Philosophical Studies 86: 243–266.

    Penn, D. C., & Povinelli, D. J. (2007). On the lack of evidence that non-human animals possess anything remotely resembling a “theory of mind.”Philosophical Transactions of the Royal Society B, 362, 731–744.

  4. This is a great discussion! Thanks Marta for offering your paper up for debate!

    A few thoughts from an empiricist…
    If interpreting results based on behavioral tests does not allow us to determine which hypothesis is supported (e.g., mind reading or behavior reading through line of gaze following), then we need additional and different types of data that might allow us to support one hypothesis and not the other. Looking at brain activity while individuals are performing these behavioral tests seems a good way forward since what we ultimately want to know is whether they imagine another’s mental state.

    For that matter, how do we know that humans actually mindread rather than just behave like we do based on previous learned associations (e.g., I have learned that when your face scrunches up like that, it is often followed by yelling, therefore when I see face scrunching in the future I predict that yelling will happen)? Is there actual evidence? When I went looking for data on what parts of the human brain are active when humans imagine something, I found that there weren’t actually any studies that investigated this (http://journal.frontiersin.org/…/fpsyg.2014.00305/full). We tend to assume that we know certain things about humans, but I think they are often just assumptions.

    Also, if I remember correctly, Hare et al’s study did not set out to test the line of gaze hypothesis, only the mindreading hypothesis. Marta’s point about not applying post-hoc explanations to results that were obtained by testing a different hypothesis is logical and still holds: Different experiments would need to be designed to distinguish between these two hypotheses.

    I’m looking forward to tracking the discussion as it continues!

  5. Hi,

    Thank you for a very interesting presentation.

    I’d like, if I may, to make two points.
    Firstly, it is logically impossible for a nonverbal to read anything, let alone a mind. Reading is a verbal skill and as such it is by definition beyond the capacity of all nonverbals. It may be objected that I am being a stickler for precision here. Perhaps so. But when was it ever wise for researchers to be imprecise with their tools?

    The question under consideration is whether nonverbals are capable of interpreting the thoughts and/or intentions of other agents. What does it mean to interpret something? To interpret an object is to ascribe a use or meaning to it. Without interpretation, a stone is just a bundle of properties. But the uses we tool-users can apply to to a stone transform it into any number of different tools. This capacity to attribute a use to an object is such a sophisticated skill that we only see it in the very most intelligent creatures.

    Attributing a use to an object is one thing but attributing meaning to an object is necessarily a far more sophisticated skill. And if nonverbals are capable of attributing psychological states to other agents (of predication in fact) then they should already be capable of attributing meaning to objects. In short, they should be capable of language. It is for this reason that I side with the skeptics, even though I’m keen to see if there might be an experiment that will prove me wrong.

    1. Hi Jim,
      Thanks for joining the suggestion. I would suggest here that everybody gets a free pass on their preferred terminology, because none of the alternatives are really ideal. The main established alternative terminology is “theory of mind”, and many people prefer “mindreading” because they don’t want to prejudge that the structure of the relevant faculty be theory-like. “Mindreading” is not ideal for the reasons you suggest, and also because it implies some psychic abilities. Given that there are already these two fairly established options in the literature, a new proposed replacement is unlikely to gain much traction. At any rate, everyone who works in this area has gone through their dark night of the soul with the available terminology and has chosen what they take to be the lesser evil. The important thing is that everyone sufficiently explicates the terms they choose to use so that others know what they mean. Unless there’s some critical unclarity with Halina’s use of the term “mindreading”, then, I suggest that it’s better we focus on the substance of the argument in the ensuing discussion.

      1. Thanks Cameron,
        My point about terminology was the first of two observations. My hope was that it would be seen as the lesser of the two.
        The main point I was seeking to raise concerned the skill of attribution. Language users are very adept at the attribution of abstract concepts to objects and agents, but the question of whether nonverbals possess any competence in such skills has – I would suggest – to be viewed in the context of the more rudimentary skills – not to mention the practices – upon which psychological predication must necessarily rely.
        Anyway, I don’t want to derail the discussion, just to clarify that my focus was directed less at terminology than the question of the nature of our skills of attribution and their reliance upon more fundamental practices of tool use and exchange.

  6. First, thanks again to my commentators for their excellent feedback! They raise many important points. I will begin here by responding to three general ones.

    First, I frame my critique of behaviour reading in terms of severe tests. Andrews and Lurz rightly point out that this approach is limited in that it relies on a particular account of evidence (the error-statistical approach). This is indeed how I present it in the paper, but I agree that it would be better to frame it more generally. As Andrews comment brings out, my point really just concerns an experiment’s ability to produce true negatives and avoid false positives or, in other words, produce a negative result when the hypothesis is in fact false. My claim is that when an experiment is unable to do this with respect to a given hypothesis H, then the positive results produced by that experiment are not good evidence for H. If readers will grant me this, then that is all I need in order to set up the main criticism presented in the paper.

    Concerning that criticism, it holds that mindreading (MR) experiments are designed to maximize the chances of producing a negative result when the MR hypothesis is false. This is done in various ways: making surprising, specific, previously untested predictions and then testing them, controlling for extraneous variables that might produce the predicted effect, etc. I agree with Andrews and Lurz that no experiment is infallible. One cannot control for all variables and errors are always possible. However, MR researchers engage in practices that attempt to minimize the chances of such errors. Insofar as they do not, they also run into problems.

    This is in contrast to advocates of behaviour reading (BR) who do not conduct BR experiments, but rather rely on the results of MR experiments for evidential support. In the past, BR theorists have taken both the positive and negative results of MR experiments as evidence for their hypothesis. However, doing this means that these experiments have no ability to detect whether BR is false – a negative BR result in this case is simply not an option.

    Andrews and Lurz, however, argue that this is not the best way to conceive of BR. If I understand correctly, they hold that the content of the BR hypothesis includes all and only the positive results of the MR experiments that have been conducted thus far (specifically, rules linking the independent and dependent variables used in those experiments). However, this seems problematic for at least two reasons. First, if this is the case, does a negative result on a newly conducted visual perspective taking task count against BR? If MR researchers became convinced on the basis of such a negative result that chimpanzees do not have level-1 visual perspective taking abilities, would we also conclude that chimpanzees are not behaviour readers? This seems unlikely. Insofar as it is, I am not convinced that the negative results of an MR experiment would count against the BR hypothesis.

    Sceptics might say here that a negative result would not count against the organism having general behaviour-reading abilities, but would count against a particular rule or set of rules (such as competitors will behave in x and y ways when they have a direct line of gaze to food). This is how I understand Lurz’s line-of-gaze hypothesis. But then the question is how do sceptics decide which set of rules to include in their hypothesis? If they make this decision on the basis of the positive results of MR experiments (e.g., drawing on the results of level-1 visual perspective taking experiments in order to construct the line-of-gaze hypothesis), then they run into the original problem. MR experiments are incapable of producing negative BR results because the BR hypothesis simply consists of all and only the positive results of MR experiments.

    Andrews and Mikhalevich make the crucial point that the BR hypothesis is treated by sceptics as the default – as a null hypothesis that we accept when we fail to have evidence for mindreading. This makes sense of the sceptics’ claim that one must exclude all possible BR explanations of a positive MR experimental result before claiming that one has evidence for MR. However there are two problems with this. First, it misconstrues the MR hypothesis. MR researchers do not hold that mindreaders are telepathic and I think this is what is required in order to block the possibility of giving a BR explanation for a MR result (perhaps another way of putting Andrew’s point that this is the problem of other minds in sheep’s clothing). Instead, mindreading is a cognitive mechanism that involves (among other things) the ability to categorize disparate behaviours into the same abstract class based on the unobservable cognitive state that they have in common.

    Second, I disagree that the MR hypothesis bears the burden of proof for rejecting behaviour-reading alternatives. Neither the MR nor the BR hypotheses are epistemically privileged to begin with, I hold, but must gain acceptance through consideration of their empirical and theoretical virtues. I am not explicit about this in the paper and agree with Mikhalevich that my main argument depends on this claim, so I cannot simply set it aside as I do there. Elsewhere I argue that although sceptics present the behaviour-reading alternative as an epistemically privileged “null” in precisely the way that Mikhalevich describes, there is no justification for such privileging. If BR is a legitimate competitor to MR, then it is a positive, causal hypothesis about the cognitive mechanisms operating in an organism and as such requires empirical support and an evaluation of its epistemic virtues relative to that of other hypotheses. Even if BR were simpler than MR (and it is not clear that it is), there is no reason why the possession of that one particular virtue should trump all others. The MR hypothesis might be more complex, but it might also be more coherent, more general, more fruitful, etc. than BR. There is no a priori reason for valuing these virtues less than simplicity.

    Apologies for the long response and for not getting to all of the great points made by the commentators! I hope that we will have a chance to discuss them as the conversation continues.

    Thanks for participating!

  7. Your point, Jim Hamlyn, is well taken! I agree that mindreading is not the best term for the cognitive capacity under discussion and behaviour reading is not the best contrast (and the terms that we use are important, as they help frame the conceptual space). This also relates to Corina Logan’s point that what might be needed in order to move the debate forward is more empirical work on the cognitive and neural mechanisms involved in producing the behaviours that we tend to interpret as mindreading versus behaviour reading. This might give us a better idea of what we are talking about and help us avoid our own folk-psychological trappings. I completely agree and think that this is one of the most promising routes forward (and one currently pursued by Josef Perner and colleagues!).

  8. Hi everyone. Thanks for the paper Marta, I liked it a lot. And great discussion so far! I have two comments.

    First, at times it seems worth pulling apart a couple of similar complaints about the BR hypothesis. First, there is the general complaint that BR hypotheses are (or often tend to be) ad-hoc. This could be a complaint about the hypothesis itself; for instance, it is often said that it simply consists of a list of rules, with no real unified mechanism. It could also be a complaint about the relationship between the hypotheses and the experiments; for instance, MR hypotheses have actually motivated the construction of the experiments themselves. I take Marta’s really novel contribution to be a similar, but different, complaint about the evidential relation between the experiment and the hypothesis. Marta employs all of these at different points (including the reply to commentaries), and it seems worth flagging the difference. This may be largely a rhetorical comment, but it matters. For instance, the fact that experiments were deigned to test MR hypotheses, but not BR hypotheses, is reason to expect that they are severe tests of MR, but not BR, but it does not mean that they are. The commentaries point to some reasons to think they (often) are not. I worry Marta is, if not conflating these, taking the claim “designed to test X” to provide most of the motivation for the claim “is in fact a severe test of X.” It seems to me that you can only establish the latter with a careful discussion of individual experiments, which I’d like to see.

    The second point is about the degree of abstraction at which MR and BR hypotheses are pitched, and I think it supports the first. As we see in Lurz’s commentary, if we get more specific about the BR hypothesis, then a MR experiment can be a severe test of BR (“line of gaze reading” rather than “whatever-behavior-it-is reading”). Marta replies that the BR hypothesis in general doesn’t specify which of these specific hypotheses are included. But a similar complaint could be made about MR. The general hypothesis that an animal has the capacity to represent mental states is, itself, extremely flexible; it does not specify what mental states are attributed, how/when they attribute them, and what predictions they generate. We have intuitions about answers to questions like these, but as far as I can tell there is not really an MR model that is much better specified than the BR ‘model.’

    This leads me to think that the best route is to look at the general pattern of results, attempt to characterize the capacity, and then decide whether that capacity counts as mindreading or behavior reading. This is what Penn & Povinelli do, as Andrews notes above. More importantly, it leaves both MR and BR in a similar position with respect to the experimental results: rather than hoping for specific experiments to be the crucial experiment, proponents of each should take the experiments to be one step in characterizing the capacity (at least, for the time being). So attacking skeptics by arguing that they take all the results, ‘positive’ and ‘negative,’ from MR experiments to be evidence for their view is more an attack on a certain rhetorical strategy being attributed to skeptics, not an attack on the idea that the evidence, properly characterized, might support BR.

  9. Thanks for the comments, Mike Dacey! Those are really good points. Concerning the first one, I absolutely agree that establishing whether an experiment is a severe test for a particular hypothesis requires looking carefully at the particulars of the experiment. How was the experiment designed? What role did the hypothesis play in its design? Did the researchers adequately control for known experimental confounds? Was the right statistical analysis employed? Etc. You (and Andrews) rightly point out that there are things that psychologists could do that would reduce the severity of their experiments. However, generally, rules of good experimental practice police against such moves. Well-trained psychologists are aware of the things that make for bad tests and avoid them. Cherry-picking data, changing the hypothesis during or after the test, running too many statistical analyses, choosing to run a one-tailed test after you have seen the data, these are things that experimental psychologists know should be avoided because they increase the risk of errors, such as false positives.

    The point that the BR hypothesis is ad hoc is not a separate one, I think. Ad hoc constructions of hypotheses increase the chances of false positives, especially if the only constraints on the construction of the hypothesis are the experimental results being assimilated. This is why a list of behavioural rules with no independent theoretical plausibility is so problematic. This is an extreme case, though, and I do not think it applies to the more specific line-of-gaze hypothesis or to Heyes’s submentalizing hypothesis. Your point that one must look at hypotheses on a case-by-case basis is an important one. A hypothesis can be constructed in an ad hoc manner, but still do other things that allow us to gather evidence in its favour, such as make a novel prediction, have coherence with other theories, or even explain current data in a coherent manner. BR runs into problems when it lacks these things.

    Concerning your final point regarding the MR hypothesis, I have to disagree. The MR hypothesis makes very specific claims about how organisms will behave in particular situations. This is what has made the experimental approach to testing MR so fruitful. The level-1-visual-perspective-taking hypothesis, for example, predicts that subjects will prefer to reach through an opaque box when stealing food from a competitor, use visual gestures towards agents who can see them, but not towards those who cannot, etc. (see the Table at 0:56 in the video for more). If these predictions were not borne out, one could not easily revise the MR hypothesis in order to accommodate this. Of course, one could engage in such revisions, but the more one does, and the more far-fetched the revisions seem, the worse for MR. In practice, this has not happened a lot. Some early auxiliary assumptions have been revised (e.g., attempts to improve the ecological validity of studies), but not that many to my knowledge.

    The general BR hypothesis lacks such predictive power. More specific hypotheses like line-of-gaze might fare better, but it seems to me that BR advocates have not explored this aspect of their view very much. Instead, the focus has been on designing better mindreading experiments.

  10. Hi Marta,

    I wanted to respond to two points you make in your reply.

    You write: “This is in contrast to advocates of behaviour reading (BR) who do not conduct BR experiments, but rather rely on the results of MR experiments for evidential support.”

    But all existing MR experiments are simultaneously BR experiments. An experiment designed to test the MR hypothesis that an animal uses a behavioral/environmental cues S to impute a mental state M for the purpose of predicting behavior B also tests the BR hypothesis that the animal uses cue S to predict behavior B. Hare et al. 2000 experiment, for example, is simultaneously a test of the MR (‘seeing’) hypothesis and the BR (‘line of gaze’) hypothesis. Hare et al.’s experiment not only tested whether subordinate chimps could predict dominant behavior on the basis of what food the dominant could and couldn’t see, it also tested whether subordinate chimps could predict dominant behavior on the basis of what food the dominant did or didn’t have a line of gaze to. Both hypotheses were simultaneously tested and confirmed by the data. Hare and colleagues may not have intended to test the line of gaze hypothesis with their experiment, but their experiment tested it nonetheless. So the positive results of Hare et al.’s 2000 are just as much evidence for the MR hypothesis as the BR hypothesis. That is why the data provide at best equivocal evidence for the seeing hypothesis.

    Furthermore, had the subordinate chimps showed no significant preference for the food behind the barrier over the food out in the open, their performance would have disconfirmed both hypotheses. Thus, Hare et al.’s experimenter is a severe test for both the seeing hypothesis and the line of gaze hypothesis.

    Your write: “In the past, BR theorists have taken both the positive and negative results of MR experiments as evidence for their hypothesis. However, doing this means that these experiments have no ability to detect whether BR is false – a negative BR result in this case is simply not an option.”

    Not true. Quite a few BR hypotheses have been tested by MR experiments and shown to be false. For example, in Hare et al. 2000, five different BR hypotheses (e.g., the dominant’s accessibility to food hypothesis; the out-of-sight-out-of-mind hypothesis; the dominant’s behavior hypothesis; the intimidation hypothesis; and the out of reach hypothesis) were tested and found to be false. In addition, in Hare, Call & Tomasllo 2001 and Kaminski, Call & Tomasello 2008, another BR hypothesis – the evil eye hypothesis – was tested and found to be false. And the peripheral feeding hypothesis, yet another BR hypothesis, was tested and found to be false in Hare et al. 2001. But, of course, not all the BR hypotheses that these MR experiments test are shown to be false. The line of gaze hypothesis, for example, was tested by Hare et al.’s 2000 experiment and was shown to be consistent with the data.

    Here is how I see the situation. The BR hypothesis is a general hypothesis that says that animals that predict the behavior of other animals’ do so by using behaviorally relevant observable cues without interpreting those cues as signs of mental states. There are a number of possibilities of what sorts of observable cues animals might use to make such predictions. Since one cannot say a priori which cues animals use, experiments are needed. A few particular BR hypotheses have been tested for some animals (mostly, chimps) and falsified (some of these BR hypotheses I mention above). Other particular BR hypotheses have been tested and confirmed (e.g., the face-visible rule hypothesis in Povinelli & Eddy 1996; the bivariate-hierarchical hypothesis in Kaminski, Call & Tomasello 2004; the experimenter-manipulating-food hypothesis in Call, Agnetta & Tomasello 2000). And still other BR hypotheses (e.g., the line of gaze hypothesis) have been tested and confirmed but make the same predictions as some MR hypotheses (e.g., the seeing hypothesis) that have been tested and confirmed. Furthermore, there is no experiment that I know of that has produced data that is more plausibly predicted and explained by a MR hypothesis than by a BR hypothesis. So just looking at the data, we cannot say whether animals are MR or BR. What is needed are better, more sensitive tests – tests that have a chance of producing data that are more plausibly predicted and explained by a MR hypothesis than by a BR hypothesis. All BR theorists (e.g., Povinelli, Vonk, Perner, Heyes, Krachun, and myself) argue that such tests are possible (e.g., the ‘goggles test’ put forward by Povinelli, Vonk, and Heyes, and the appearance-reality mindreading tests put forward by Krachun and myself). However, to date, these tests have not been run.

    1. Thank you for your response, Robert! I think it gets to the heart of the issue that I’m worried about.

      You write that both the MR and BR hypotheses are tested by experiments like Hare et al. 2000, even though the experiment was designed to test the former, not the latter. This is precisely what I would like to dispute. The independent and dependent variables of experiments such as these were chosen in order to test the MR hypothesis. One cannot, I claim, come in and swap that hypothesis with a rule that states that the independent and dependent variables will vary in precisely the way observed. If one has reasons for thinking that these two variables would have had the observed relationship independently of the MR hypothesis (as per the video example, one discovers that chimpanzees are scared of transparent boxes near competitors), then, yes, this is an experimental confound that needs to be eliminated before we can claim to have evidence for MR. You rightly point out that the elimination of such BR confounds has been an important part of the MR research program.

      However, psychologists cannot and should not be required to eliminate every conceivable BR confound. They cannot because one could always posit that subjects are implementing a rule that links the independent and dependent variables being tested (including in the “goggle test,” as I have argue elsewhere). They should not because that is an unrealistic standard, which we do not hold other research programs to. As I write in the paper, one could use the alternative-hypothesis objection as a sceptical foil aimed at undermining a research program. We need to guard against this. It is my understanding that this strategy has been employed by tobacco companies to undermine the claim that smoking causes lung cancer and climate-change sceptics to undermine claims about the anthropogenic origins of global warming. If these cases are indeed examples of illegitimate sceptical foils, then we need to be clear about what distinguishes the BR alternative-hypothesis objection from these cases.

      Concerning whether the negative results of MR experiments count against the BR hypothesis: Insofar as the line-of-gaze and MR hypotheses make the same predictions about the S-B relationships we should find in a given subject group, I agree that a negative result would count against both hypotheses. But I’m not entirely convinced that the line-of-gaze hypothesis predicts these relationships. If one had no access to the results of MR experiments or the MR hypothesis, would one still be able to make these predictions? There are many, many S-B relationships that an organism such as a chimpanzee might (or might not) have knowledge about. Which ones should we expect them to know? If the BR theorist’s answer is “those that they do know,” then I would say that the BR hypothesis lacks the content to make concrete predictions. The MR hypothesis, on the other hand (and this goes back to Mike’s point), does make such predictions. The immense predictive power of mental-state reasoning is puzzling, no doubt. The fact that attributing this reasoning ability to chimpanzees and scrub jays has survived experimental scrutiny is even more astounding. Most psychologists were sceptics to begin with. However, the experimental evidence has since convinced many that these organisms are engaging in some form of mindreading and I think that this is for good reason. We could of course get better evidence with more tests and by implementing more controls, but to say that the alternative-hypothesis objection undermines all evidence for animal mindreading is, I think, unwarranted.

  11. As Irina points out, a lot of the appeal of scepticism seems to come from a sense that non-mindreading hypotheses are ‘simpler’ and therefore preferable. But it’s really not clear to me that this is always true, at least in chimps and other fairly close relatives of humans. Even if mindreading is a more complex faculty than behaviour-reading, given that we seem to know that humans mindread (pace Corina’s doubts), it might be simpler to hypothesise continuity between humans and nonhumans in how we predict behaviour, than to hypothesise different mechanisms for the same purpose.

    By analogy, suppose we encounter a new species of mammal, and we’re wondering what sort of heart it has. Prior to looking directly, surely the hypothesis that it has a 4-chambered heart like all its relatives is ‘simpler’ than the hypothesis that it, unusually, has a simpler design of heart. Of course the mindreading case is harder, because many species have 4-chambered hearts and only, fairly unusual, species is conclusively known to exhibit mindreading. But the point is that deciding which hypothesis is simpler can be tricky.

    Here’s another way that mindreading might be a simpler hypothesis: if mindreading works by simulation, then it seems like it might actually be a fairly simple mechanism, since it can represent a whole bunch of mental states using the same machinery (viz. some mechanisms to initiate and quarantine a simulation, and the existing capacity to have various mental states). Compared to this, the behaviour-reading hypothesis seems more complicated, since each rule about behaviour has to be stored separately.

    1. Hi Luke,
      Interesting hypothesis about the mechanism behind mindreading. I think this would be a fruitful area to develop and start testing, just like people do with cognitive maps, to get a handle on what we are actually talking about.

      Irina Mikhalevich has an excellent paper on why preferring parsimony is unjustified: https://www.academia.edu/6178673/A_critique_of_the_principle_of_cognitive_simplicity_in_comparative_cognition

      Out of curiosity (and a desire to check out your papers), who are you?

      1. Just to give a small plug for an upcoming event at the Brains Blog, next month we will host a Mind & Language symposium on Hayley Clatterbuck’s paper “Chimpanzee Mindreading and the Value of Parsimonious Mental Models” where she and the commentators will confront the parsimony arguments directly.

Comments are closed.