Christopher Chabris

For Better or Worse, the A/B Effect Is Real

2019-11-07T07:05:00.001-08:00

This post is by Michelle N. Meyer, Patrick R. Heck, Stephen M. Anderson, William Cai, Duncan J. Watts and Christopher F. Chabris

1. Introduction

Earlier this year we published a set of studies in PNAS (open access article; Supplemental Information; OSF preregistrations, data, and code) on what we called the “A/B Effect,” which we defined as occurring when people “judge a randomized experiment comparing two unobjectionable policies or treatments (A and B), neither of which was known to be superior, as less appropriate than simply implementing either A or B for everyone” (Meyer et al., 2019, p.102074). More precisely, we found that when separate groups of participants rated the appropriateness of policy A (for example, giving teachers a yearly bonus), policy B (e.g., giving teachers more vacation days), or an A/B test with the explicit goal of learning which policy leads to the best outcome (and then making that the policy for everyone going forward), the A/B test received lower mean appropriateness ratings, and more participants explicitly rated it as inappropriate, than either A or B. We observed this effect to varying degrees when people evaluated vignettes in several domains, including health care, humanitarian aid, education, tech, and “corporate” experimentation. We concluded that policymakers might face more backlash from running A/B tests to determine the effects of their proposed policies than simply using their intuition to pick and impose a policy on everyone—a possibility that we, as fans of evidence-based decision-making, find lamentable.

Today, Berkeley Dietvorst, Robert Mislavsky, and Uri Simonsohn have posted on Simonsohn's Data Colada blog a critique of our paper. Their post is an expanded version of a Letter by Mislavsky, Dietvorst, and Simonsohn (henceforth MDS) published yesterday in PNAS (our Reply will appear in PNAS soon) (UPDATE: our Reply was published in PNAS on 12 November 2019). MDS claim that “experiment aversion” (a term they use to refer to the A/B Effect we described, but to which they attach a more narrow meaning—a point we will elaborate on below) can be explained away as the result of a statistical artifact, and is therefore not actually occurring in our studies. We lay out this argument in section 1 below, but it boils down to a claim—for which they report evidence in their own paper that is in-press at Marketing Science—that people’s rating of an experiment is never lower than their rating of the worst policy that that experiment contains. Because our PNAS studies were all between-subjects, where participants rated A or B or the A/B test, but not all three, we can only compare one group’s ratings of the A/B test to the ratings of A and B made by the other two groups of participants.

In designing our studies, we made a deliberate decision to conduct between-subjects experiments. In the real world, people are rarely given the opportunity to evaluate option A, option B, and a test comparing these. When a policy is implemented, it is rare that rejected alternative policies—or policies the decision-maker never thought of—are announced alongside of it. Similarly, when an A/B test comparing two policies is announced, it is virtually never pointed out that the decision-maker could have chosen to impose A or B on everyone. The world, in other words, is a between-subjects experiment, where people rarely realize or contemplate the fact that they are, in effect, subject to policies and decisions that make them participants in any number of uncontrolled experiments. To choose anything other than a between-subjects design, therefore, would have sacrificed the external validity of our studies.

We did think, however, that alerting raters of A/B tests to the fact that agents often have the power to simply impose either untested policy on everyone might improve perceptions of experiments and/or that alerting raters of policies to foregone alternatives might make people more skeptical of policies. Either (or both) would have the effect of reducing the A/B Effect—and mitigating people’s objection to experimentation is one of the ultimate goals of our research program. To that end, we had long planned to run within-subjects experiments, some of which we have now conducted and reported in a preprint and describe below. In all three of the scenarios from Meyer et al. (2019) that we have so far tested in a fully within-subjects design, including one from the Study 3 scenarios that MDS prefer, we find robust evidence not only of the A/B Effect, as we defined it (AB < mean(A,B)), but also of increasingly conservative forms of “experiment aversion,” including MDS’s definition (AB < min(A,B)) and an even more conservative definition which requires individuals to rate the AB test as inappropriate while also not objecting to policy.

In this essay, we analyze MDS’s arguments and explain why there is strong evidence for both the A/B Effect and for MDS’s more narrow “experiment aversion,” both in our new within-subjects experiments and in the studies we reported in PNAS (evidence MDS ignore in their blog post and PNAS Letter). We end by addressing MDS’s objections to our Checklist and Best Drug scenarios and making some observations on their research and reporting practices.

2. The mathematical fact behind MDS’s critique

The core of MDS’s argument is a simple mathematical fact: given pairs of samples from two distributions with different means, the mean of the minimums must be less than or equal to the minimum of the means (in math, mean(min(A_i,B_i)) <= min(mean(A_i),mean(B_i)), where A_i and B_i are the ith respondent’s ratings of A and B respectively). In the context of our studies, if each participant is asked to rate their approval for two different policies, taking the lower rating of each person’s pair of ratings, and then taking the average of these lower ratings, is guaranteed to be less than or equal to the result of first averaging over all participants for each policy and then taking the minimum of these means.

We have no argument with this mathematical observation. In theory, it could account for what MDS call “experiment aversion.” In other words, when people rate experiments comparing two policies, A and B, as less appropriate than implementing either Policy A or Policy B alone for everyone, it could just be that in the experimental condition they see both policies and their rating of the experiment is just the lower of their ratings of the two policies. In other words, people’s rating of the A/B test could be equal to their rating of the policy (A or B) that they, personally, like the least. In essence this is what MDS claim to have found in their own in-press paper.

Finally, we acknowledge that in most of our scenarios we did not directly test for this possibility. As noted earlier, we adopted a between subjects design; thus, in most of our scenarios, only the participants in the A/B condition saw both policies. We therefore compared the mean of the A/B condition with the minimum (and mean) of the A and B means, but were not able to compare the A/B condition to the mean of the minimums.

In other words, we agree that MDS have proposed a hypothesis that could explain why participants often rate A/B tests less favorably than universally implementing either policy A or B.

3. MDS draw inappropriate conclusions from this fact

If all they had done were to propose an additional possible explanation for the A/B Effect findings in our Study 3, we would not object. However, their critiques go further, claiming via simulations that the mathematical fact they rely on completely accounts for all our findings, rendering our claim about the existence of an A/B Effect invalid.

Their empirical exercise combines a small amount of new data (99 MTurkers judging the appropriateness of the A and B policies from all seven experiments in Study 3 of our PNAS paper) with some of our data (our participants’ evaluations of the seven A/B test and policy conditions from the same study) to simulate what would have happened if our participants had evaluated not just the A/B test, but also the A and B policies. In effect, they are simulating the outcome of within-subjects experiments and then claiming to have found no evidence that such experiments would find the A/B test to be judged worse than the worst policy. Their simulation assumes, however, that they can infer how our participants who saw only the A/B test would have rated the A and B policies, had they also seen them. This assumption is unwarranted, as our within-subjects experiments suggest (see Section 4, below); A/B tests are sometimes rated differently when the raters are explicitly told that A and B could have been implemented unilaterally, than when they are only told about the A/B test.

There are other problems with MDS’s second, more expansive claim, First, even if it were true that lower ratings for experiments are always fully explained by individuals’ least preferred policy, it simply does not follow that people never object to experiments comparing unobjectionable policies. (This claim is made in the subtitle of MDS’s in-press Marketing Science paper, which is “People don’t dislike a corporate experiment more than they dislike its worst condition.”) This first problem amounts to a difference in definitions. MDS present their finding—that they fixed a “problem with the statistical analysis” in our paper—as reflecting a mathematical fact. However, their conclusion depends on a particular definition of experiment aversion, in which an individual participant presented with all three possibilities—A, B, and A/B (the A/B test)—ranks A/B the lowest. Although this definition makes sense in light of MDS's objectives, as explained above, it does not make much sense in the real world.

Put another way, if any part of participants’ objections to an experiment derives from the fact that they, unlike their peers who only see A or B, have been made aware that alternatives exist, that is a mechanism of the psychological phenomenon we reported. Rather than being a “confound,” we view the nature and extent of this mechanism as an important empirical question. Presumably MDS feel differently, but that is a difference of opinion regarding which phenomena are interesting to study, and not at all a “problem with the statistical analysis” on our part that they “corrected.”

Second, it is not true—both in the studies that we have conducted since (see Section 4) and in the studies we have already published (see Section 5)—that people object to experiments no more than they object to their least-preferred of the policies the experiment contains. To the contrary, people often rate experiments as significantly worse than the policy they personally regard as worst (and explain their ratings by objecting explicitly to the experiment rather than to the treatments it compares).

4. Within-subjects evidence for the A/B Effect and “experiment aversion”

MDS propose that a true test of experiment aversion is one in which a participant’s rating of an A/B test can be compared directly to his or her least-preferred option (A or B). We have now run several of our own scenarios using a fully within-subects design that allows this test to be run. The results of our within-subjects studies clearly support the existence of not only our A/B Effect but also MDS’s “experiment aversion.” Below, we explain some context and the timing of these experiments, followed by a short summary of the results. Interested readers can refer to our preprint.

We found evidence for an A/B Effect and for MDS’s definition of experiment aversion in three out of three high-powered, preregistered experiments where we adapted scenarios from Meyer et al. (2019) to a within-subjects design. We began these experiments before learning about MDS’s critique. [1] The scenarios we ran included Hospital Safety Checklist (where a hospital director chooses between placing safety reminder checklists on posters in procedure rooms or on doctors’ ID badges), Best Drug: Walk-In Clinic (where some doctors prescribe Drug A and others prescribe Drug B to all of their patients, effectively randomizing which drug patients receive when they walk into the clinic), and Consumer Genetic Testing (where a genetic testing company chooses to either return medically actionable genetic test results or all genetic health results, including those which customers can do nothing about). We chose these scenarios because they varied in domain (health, clinical medicine, or corporate), the decision-making agent (a hospital director, a doctor, or a CEO), and the size of the A/B Effect observed in the between-subjects versions.

The only difference between these experiments and those reported in our original paper was the design: here, participants rated all three options (Policy A, Policy B, or an A/B test presented in counterbalanced order) on the same page. This design gives participants the option to rate their least-preferred policy (of A or B) equally to the A/B test—a pattern that should emerge if MDS’s hypothesis is correct. This is not what happened. We refer readers to our preprint for more details, but the results are described in brief below.

First, we can simply look at how many participants objected to each condition. Figure 1 shows the percentages of participants who objected to A, B, or the A/B test (by rating them as somewhat or very inappropriate). Whereas only ~10% of participants objected to each policy option across the three scenarios, more than one-third objected to the experiment—even when these participants had also read about the two unilateral policy options. This result also refutes the additivity claim that MDS make in their critique via a dessert-allergy analogy: in each scenario, the striped bar is substantially taller than the sum of the black bar and white bar. Participants in these studies felt it was more appropriate to assign A or B to everyone than to run an A/B test with the explicit purpose of learning which policy is most effective, even when they had complete information about all three options.

Figure 1 (from Heck et al., 2019)

Second, this design allowed us to test MDS’s proposed hypothesis directly, so we did. Following the exact summary of this test in their blog post, “Policies A and B were evaluated by the same people, and we compared the least acceptable policy for each participant to the acceptability of the experiment.” In other words, we took only the lower-rated policy between A and B for each participant, averaged over this measure, and compared it with the average rating of the experiment (computed over all participants). Figure 2 shows participants’ average ratings of Policy A (A symbols), Policy B (B symbols), the A/B test (X symbols), each participant’s average policy rating (black circles), and the measure of each participant’s least-preferred policy (black diamonds). Participants viewed the A/B test conditions as less appropriate than they viewed A, B, mean(A,B), and their least-preferred between the two (min(A,B)). By all measures and accounts, these results refute the premise that people don’t object to experiments.

Figure 2 (from Heck et al., 2019)

Finally, a fully within-subjects design allows us to conduct an analysis that is even more restrictive than what MDS propose. We argue that it would be face-valid evidence for the A/B Effect if a participant explicitly objects to an A/B test (by rating it somewhat inappropriate or very inappropriate) while also choosing not to object to either Policy A or Policy B (by rating both as very appropriate, somewhat appropriate, or neither appropriate nor inappropriate). In each experiment, approximately 27% of participants did just this. In short, nearly one-third of participants demonstrated the strongest possible version of the A/B Effect by objecting to an A/B test while simultaneously choosing not to object to unilateral policy implementation of the untested policies it is designed to evaluate. [2]

5. Evidence from our PNAS studies that MDS ignore

Aside from our definitive within-subjects studies, there was already significant evidence in our PNAS studies to cast doubt on MDS’s claim that the A/B Effect wholly reduces to aversion to individuals’ least-favored policies.

First, in some of our scenarios, the percentage of participants who object to the A/B test exceeds those who object to policy A plus those who object to policy B. Even assuming zero overlap between those who dislike policy A and those who dislike policy B (a fairly implausible assumption), then, cannot fully account for the experiment aversion we observe in those scenarios.

Second, as we note in our PNAS article, anticipating this concern about joint evaluation is what led us to run Studies 4 and 5 in Meyer et al. (2019), which provide good evidence that participants were objecting to experimentation and not merely to a disliked policy. In these studies, the A/B tests compared generic treatments—“Drug A” and “Drug B.” Unlike MDS’s hypothetical A/B test in which 30% of people (those with lactose intolerance) hate the dairy treatment, 30% of people (those with peanut allergies) hate the peanut treatment, and therefore 60% of people hate the A/B test, there is no rational basis for one group of raters to hate “Drug A” and another group to hate “Drug B,” yielding an A/B Effect that is driven entirely by individual raters’ least-preferred policies. And so MDS’s lactose-peanut hypothesis cannot explain the fact that, in our original and replication MTurk experiments, about 10% of people in each policy condition objected to the agent’s decision to prescribe Drug A (or B) for everyone, but about 35% of people in the A/B conditions objected.

Finally, one obvious way to determine why people disapprove of experiments is to ask them. For this reason, in all of our experiments we asked our participants to tell us why they gave the ratings they gave. Of the 16 experiments reported in PNAS, we coded free responses from 4 (covering the Checklist and Best Drug: Walk-In scenarios), using two independent raters and a codebook of 28 codes. As we noted in the article (p. 10725), a fairly small minority of A/B test participants remarked that one policy was superior to the other or objected to unequal treatment of people, per se, and excluding those participants still yielded a large A/B Effect:

14% of participants in the A/B conditions of study 1 and its first replication [Checklist] commented that one policy was preferable or that the A/B test treated people unequally. Notably, excluding these participants still yields a substantial A/B effect in study 1, t(388) = 11.3, p = 0.001, d = 1.14, and its replication (study 2a), t(354) = 7.30, p = 0.001, d = 0.78.

Although in their letter to PNAS MDS claim to have invalidated our results based on the so-called “minimum mean paradox” described above, they then go on to raise other unrelated objections to the studies reported in our paper for which (as we show in our forthcoming reply in PNAS) their analysis fails to deliver their desired result. We turn to those objections now.

6. Responses to MDS's criticisms of our health domain designs

MDS dismiss the Checklist scenario as “confounded” because the A/B condition differs from the policy conditions not only in involving a randomized experiment, but also in a second way: Only in the A/B condition, they say, are raters confronted with a decision to forgo a potentially life-saving intervention for no disclosed reason, whereas in the policy conditions, “there is only one option available.” But this is not true. When, in the A condition, the hospital director decides to implement a Badge policy for everyone, he necessarily forgoes (without explanation) a Poster policy (and any number of alternative or additive policies), any one of which might have superior life-saving powers than Badge, either as a substitute for Badge or as an addition to it. The fact that policy implementers rarely disclose alternative or additional policies they have foregone does not mean those policies are not “available” or that the policy implementer’s decision “caused” deaths any less than did the director who ran an A/B test.

We agree entirely that our A/B test descriptions necessarily make extremely salient the fact that some people will be denied Badges and others will be denied Posters, whereas people rarely take away from an announcement of a universal Badge policy the consequence that everyone is being deprived of Posters. But that is the nature of how A/B tests and universal policies are announced, respectively, in the real world, and that real-world setting is precisely what we attempted to capture in our vignettes (see our previous comments about external validity). To the extent that A/B tests are perceived to deprive people of important interventions while decisions to choose between those same interventions in the form of a universal policy implementation are not so perceived, that is a mechanism of the A/B Effect, not a “confound” to be avoided. [3]

MDS suggest that this “confound” could be avoided by modifying the scenario so that the agent in all three conditions considers two options and opts for only one, for no disclosed reason. Of course, that is exactly what we did in Studies 4, 5, and 6b. In Study 4, participants in all conditions (A, B, and A/B) were told that multiple FDA-approved blood pressure drugs exist, but Dr. Jones decides to give all of his patients “Drug A” (or “Drug B,” or conduct an A/B test of Drugs A and B to see which helps his patients most and offer them that one going forward). Similarly, in Study 5 (replicated in a health provider sample in Study 6b), all participants in all conditions are told that some doctors in a walk-in clinic prescribe “Drug A” to all of their patients while others prescribe “Drug B to all of their patients.” Dr. Jones then either decides—for no disclosed reason—to give all his hypertensive patients Drug A (or Drug B, or conduct an A/B test).

However, MDS have two reasons for dismissing the results of these scenarios. First, they say, only in the A/B condition is Dr. Jones uncertain about how to treat his patients. This is partially true—and entirely irrelevant. It is true in the sense that Dr. Jones, as an individual, demonstrates uncertainty only in the A/B condition, as do all agents who embark upon an A/B test (although we prefer to describe them as having epistemic humility). But all of the walk-in clinic doctors collectively demonstrate uncertainty about the best treatment for their patients. Participants rating all conditions are told that some doctors prescribe Drug A to all of their patients, while others prescribe Drug B to all of their patients—and these patients walk in off the street and are heretofore unknown to the doctors, so it is not the case that either patients or doctors are selecting one another in ways that would make a particular drug make more sense. Hence, this is a case, as we noted in PNAS, of unjustified variation in medical practices: the very common situation in which providers choose and assign to all their patients one treatment according to accidents of how they happened to have been trained, or which conferences they happen to have attended, and so on. The fact that experts and other agents are punished for having the epistemic humility to conduct an A/B test while agents who use their intuition to pick one policy for everyone are rewarded for being (overly) confident is a fascinating mechanism of the A/B Effect (we call it a proxy version of the illusion of knowledge in the PNAS paper), not a reason to dismiss evidence of the effect.

Second, MDS say that in the Best Drug scenarios, only in the A/B condition is Dr. Jones “doing something potentially illegal (running a medical trial without oversight).” First, although all of our scenarios were silent about things like consent, notice, and IRB oversight, some of our participants assumed (not unreasonably) that the agents in our vignettes had no intention of pursuing any of these things (we plan to conduct follow-up experiments in which we experimentally manipulate these aspects of the vignettes to isolate the extent to which they contribute to the effect).

As for whether Dr. Jones is doing something illegal, although pre-market drug (and device) trials are indeed heavily regulated by the FDA, this scenario involved drugs already approved by the FDA. In the U.S. (all our participants were U.S. residents), the primary source of regulation of Dr. Jones’s A/B test would be the federal regulations governing human subjects research (known as the Common Rule). The Common Rule only applies by federal law to human subjects research funded by one of several federal departments or agencies. Since Dr. Jones “thinks of two different ways to provide good treatment to his patients, so he decides to run an experiment,” it seems unlikely that he has an in-hand NIH grant to support this activity. Most research institutions do apply the Common Rule as a matter of institutional policy to all human subjects research conducted there, but failure to seek IRB approval wouldn’t constitute an “illegal drug trial.” Moreover, the Common Rule only applies to activities “designed to develop or contribute to generalizable knowledge.” The Common Rule unhelpfully does not define this phrase, and IRBs can and do arrive at different interpretations. An IRB could find that all of the vignettes we constructed constituted (unregulated) quality improvement (QI) activities (where neither IRB approval nor consent is required) rather than (regulated) human subjects research. Randomization does not automatically make an activity human subjects research and health systems can and do conduct “quality improvement RCTs” without IRB review or consent (e.g., Horwitz, Kuznetsova & Jones, 2019). Even when an IRB determines that an activity constitutes research, the Common Rule permits waivers for minimal risk research that would not otherwise be “practicable”—another undefined term that IRBs interpret differently, but can include scenarios where notifying patients that they are in a trial would would invalidate the results.

MDS acknowledge that they do not know whether Dr. Jones’s A/B test was illegal or not, and say that it is a problem that our participants probably don’t know, either. But to the extent that misconceptions about when things like consent and oversight are and are not legally (or ethically) required make randomized evaluation harder, then that is one more mechanism of the effect we are studying.

7. Observations on MDS’s presentation of evidence

Finally, we remark on some curiosities in the way MDS present their case against our PNAS paper. First, they focus exclusively on the seven scenarios we tested in our Study 3, dismissing the ones from our Studies 1, 2, 4, 5, and 6 as irrelevant for idiosyncratic reasons. This means they are dismissing the totality of our results purely on the basis of an empirical exercise they conduct on less than half of our experiments—coincidentally, the ones in which we found the smallest effects (including one we reported as a null effect). In their PNAS Letter, they justify this choice by claiming that our Study 3 is the most similar to the studies they conducted for their in-press Marketing Science paper. But mere similarity to one’s own work is not a scientific rationale for selecting what part of someone else’s work can stand in for the whole. As we show in our forthcoming PNAS Reply, replacing just 3 of the 7 scenarios MDS chose to present with 3 of the ones they left out renders the evidence for their own form of “experiment aversion” significant in our original PNAS data.

Second, MDS went to the trouble of recruiting nearly 100 MTurkers to provide evaluations of the 14 policies from our Study 3, but they didn’t take the easy extra step of having the same participants evaluate the A/B tests as well. That is, they stopped just short of carrying out the within-subjects experiment that would have definitively tested their claim.

Third, as far as we can tell, MDS did not preregister the hypotheses or the data-collection and analysis plans for their empirical exercise. (In our PNAS paper, except for Study 1, every pilot, main study, and replication we reported was pre-registered, and we reported every single study we had done on this topic.) This is especially odd because Uri Simonsohn is a proprietor of AsPredicted, a popular website and service for pre-registering studies.

Fourth, in the main text of their Data Colada blog post, MDS introduce their discussion of our Checklist and Best Drug scenarios (Studies 1, 2, 4, 5, and 6 from our PNAS paper) as follows: "Next, the discussion that did not fit in the PNAS letter: the concerns we had with the designs of the 2 scenarios we did not statistically re-analyze" (emphasis added). However, we recently noticed that point 4 in footnote 3 of their post contradicts this statement. That footnote implies that MDS carried out their empirical exercise using nine of our scenarios, not just the seven they report in their PNAS Letter and the main body of their post, and found data supporting our hypothesis rather than theirs. As shown below, that footnote states "... in what we refer to as the 8th and 9th scenarios in the PNAS paper, the only two where in our re-analyses, experiments were objected to more than to their worst individual policy" (emphasis added).

There is no other mention of those other “re-analyses” in their PNAS Letter, blog post, or corresponding OSF archive, so we are confused.

8. Conclusion

We regret that our dispute with MDS has resulted in adversarial blog posts and PNAS Letters. We have been fans of much of Simonsohn’s previous work and of his blog. We found MDS’s in-press paper intriguing, and we stated publicly that we looked forward to doing studies that might explain why they reached different conclusions from us.

After MDS’s Letter was sent to PNAS, and after we submitted to PNAS our requested Reply, we discussed with Simonsohn some alternatives to this public argument, such as writing a joint commentary, or editing each other’s letters to resolve as many disagreements as possible before publication, or even starting an adversarial collaboration (e.g., Mellers, Hertwig, & Kahneman, 2001) to collectively gather new data that might sort things out. We remain open to working or at least having constructive discussions in the future with MDS and anyone interested in this topic.

Notes

[1] PNAS first contacted us on August 20, 2019, to tell us that MDS had submitted a Letter and to request our response. Our first within-subjects experiment was pre-registered on July 29 (materials testing preregistered on July 26). In that experiment, we showed and asked participants to rate all three conditions from the Meyer et al. Checklist scenario—but one at a time (in randomized order). After sequentially rating each of the director’s three choices, we invited them to revise their ratings after they were exposed to all three options. The results of this “Checklist—sequential” experiment support the A/B Effect and experiment aversion more narrowly, but we plan to report them in the Supplemental Information that will eventually accompany the preprint. Next, on August 12, 2019, we conducted a within-subjects experiment of Checklist in which participants were simultaneously shown and asked to rate all three decisions, presented in randomized order (materials testing preregistered on August 8). These results are reported in our preprint. On September 2, we preregistered a “simultaneous” within-subjects test of the Best Drug: Walk-In scenario from Meyer et al. (materials testing preregistered on September 1), also reported in the preprint. Finally, on October 15, we preregistered “simultaneous” within-subjects tests of three Study 3 scenarios from Meyer et al.: Direct-to-Consumer Genetic Testing, Autonomous Vehicles, and Retirement Savings. We preregistered and conducted materials testing for all three scenarios on October 14, but to date, we have only conducted a full experiment with the DTC Genetic Testing scenario, reported in the preprint. We will conduct the full experiments with Autonomous Vehicles and Retirement Savings soon.

[2] In these within-subjects experiments, we also asked participants to rank the agent’s three options (A, B, and A/B test) and in each of the three scenarios we tested, we find large portions of participants who rate the A/B test as the decision-maker’s worst option. We observe this, even though an A/B test necessarily assigns only 50% of people to the rater’s least-preferred policy—which ought to be superior to the option of assigning 100% of people to the rater’s least-preferred policy, if people’s dislike of experiments only ever reflects their dislike of the worst policy it contains.

[3] We take the main thrust of MDS's objection to Checklist to be the alleged "confound," in which A/B raters, but not policy raters, are confronted with "causing deaths for a trivial benefit (e.g., saving cost)." For the sake of comprehensiveness, we note that in the real world, especially in the non-profit world of hospitals and health systems that operate on the slimmest of margins, resources are often extremely constrained. Reprinting and redistributing all provider badges—or printing and hanging posters—is not a "trivial" decision, especially if scaled across a large system. Moreover, even assuming a magical world in which money is no object, more interventions are not always superior to fewer interventions. Finally, two or more treatments are often mutually exclusive (as is the case with blood pressure Drug A and Drug B in our Best Drug scenarios).

Why Colleges And Universities Should Not Disinvite Speakers

2016-09-08T19:22:00.000-07:00

Since 2000, more than 140 people who have been invited to speak on American college and university campuses have been “disinvited” before they could give their talks, usually after objections from students. It is easy to find news accounts of these events, or non-events: this handy database details (as of this writing) 342 successful and unsuccessful disinvitation campaigns. One of the most prominent recent examples may be the New York University administration's decision in September 2016 to cancel a lecture by the Nobel prizewinning biologist James Watson, on strategies for curing cancer, six days before it was scheduled to happen, because of student complaints about statements Watson had made on other topics in the past. Later in the same academic year, Watson was also disinvited from giving a talk on the same topic at the University of Illinois.

Notably, these events were not reported in any mainstream media outlets, which mostly seem to regard suppressing lectures as normal business in academia today. Indeed, campus groups are less likely to invite controversial speakers in the first place, given how likely it is that such invitations will meet with opposition and possible cancellation. It is much harder to find stories about academic leaders who clearly rejected demands for disinvitation and clearly explained why. If I were a college president, and a campaign to disinvite a speaker arose on my campus, here is the letter I would write.

To The Campus Community:

Recently, an organization on our campus announced that a particular person has been invited to speak here. Many students, some faculty, and a few alumni of our institution have publicly objected to the invitation of this speaker. Some have demanded that his invitation be rescinded, so that he will not be able to use our “platform,” and the imprimatur of our college will not be attached to the controversial things he was expected to say—or to anything he has said or written in the past. It has been alleged that the speaker will make some students feel uncomfortable or unsafe, that his beliefs are repugnant, and that his ideas are not rational or grounded in solid evidence.

I have heard these demands, and listened to the arguments of their supporters. I am writing to say that I do not agree with them, to announce that the speaker’s talk will go forward as planned, and to explain why.

First, let’s be clear that this is not a matter of “freedom of speech” in the legal sense. The First Amendment to the U.S. Constitution, as interpreted by the courts, says that governments may not abridge freedom of speech, but it puts no restrictions on private institutions like our college. Disinviting someone is unprofessional and rude, but we have the legal right to disinvite—or to not invite in the first place—anyone we please.

Some will say that if I do not disinvite this speaker, I am therefore supporting him. This is not the case. There is a clear logical distinction between endorsing a person’s claims and beliefs, and giving him an opportunity to express those claims and beliefs. There are many speakers who have come to our campus with whom I disagree, but I did not block them. We do not ban books from our library or websites from our network, so we do not ban speakers from our grounds. Constitutional law does allow for some non-commercial speech, such as explicit incitements to violence or disorder, to be punished by the state. And I would certainly agree to interrupt or block a speech that was directly threatening anyone’s immediate physical safety. But beyond that, it is not my place to decide who is and is not permitted to speak here.

In fact, no one’s personal preferences should have anything to do with this question. To see why, we must consider what our college, or any college, is here for. An institution of higher education is organized around the concept of learning. Learning is why we are all part of this community. Students are here to learn about the arts, sciences, and other disciplines they pursue. Professors are here to learn as well: to learn entirely new things about the natural, social, and humanistic worlds, and to learn how to teach more effectively. The staff and administration are here to make these endeavors possible.

Learning doesn’t just mean going to class, doing homework, and taking exams. If a group of students or faculty members are so interested in hearing, debating, and engaging with the ideas of a person from outside our community that they decide to invite him here and organize and attend an event, I cannot rebuke them. In fact, I congratulate them, for they are engaged in an act of learning that goes beyond what is strictly required of them. They are spending their personal time and energy furthering the central purpose of our institution. Even though I may not like all the speakers they are selecting, I still love the fact that they are bothering to select speakers at all.

The reputation of our college and the value of the degrees we confer will not be affected by the speakers we host, but it will suffer if we acquire a reputation for stifling unpopular views. A college does not need to “manage” its “brand,” and it should not act like a for-profit company in this respect. All colleges stand for excellence in scholarship; that is the only brand that matters, and disinviting speakers and suppressing thoughts will only cheapen it.

Even speakers who espouse ideas you find dangerous and are sure you would never accept—like a creationist, a 9/11 “truther,” a genocide denier, or someone who argues that “rape culture” is a liberal myth—may be worth hearing. If you find out what they claim for evidence, and the kinds of words, phrases, and arguments they use, you can better rebut them yourself—whether you are reasoning with other people, or questioning your own beliefs. Listen to the other side's case in order to strengthen your own. In other words, know thy enemy.

To those who say the speaker may make them feel unsafe, I must point out that higher education is not designed to make people safe. Instead, it is our society’s designated “safe space” for disruptive intellectual activity. It’s a space that has been created and set apart specifically for the incubation of knowledge, by both students and faculty. Ideas that may seem dangerous or repugnant can be expressed here—even if nowhere else—so that they can be analyzed, discussed, and understood as dispassionately as possible. Many of humanity’s greatest achievements originated as ideas that were suppressed from the public sphere. Some, like the theory of evolution by natural selection, equal rights for women and minorities, trade unions, democracy, and ironically even the right to free speech and expression, are still seen as dangerous decades and centuries later.

If you are against this speaker coming here, please also consider this: Some members of our community—some of your friends and colleagues—do want him to visit. By asking me to disinvite him, you are implicitly claiming that your concerns and preferences are more important than those of the people who invited him. Are you really sure that you are so right and they are so wrong? Psychologists have found that people tend to be overconfident in their beliefs, and poor at taking the perspective of others. That might be the case here.

A decision by me to bar this speaker would have far-reaching negative repercussions. It will make everyone in our community think twice before they stage a provocative event or invite a controversial speaker. Canceling this invitation will not only prevent this person from talking; it will reduce the expression of views like his in the future, and probably chill speech by anyone who could be regarded as controversial. And it will set a precedent that future leaders in higher education may point to if they feel pressured to do the same. All of this would be antithetical to our common purpose—and our institution's social function—of learning and discovery.

Note that it’s especially important for us to be open to viewpoints not already well-represented among our faculty. The professors here are a diverse group, but many studies have shown that professors tend to be more politically left-wing than the population at large. Even the most conscientious instructor may inadvertently slant his teaching and assignments towards his own political viewpoint. Of course, this applies more in the social sciences and humanities than in math or physics, but it does happen. Giving campus organizations wide latitude to invite the speakers they wish helps to increase the range of thoughts that are aired and discussed here.

If you feel that this speaker’s talk might upset you, I offer this advice: Go. Yes, go to the talk, listen to it, record it—if the speaker and hosts give permission—and think about it. Expose yourself to ideas that trouble you, because avoiding sources of anxiety is not the best way to cope with them. When you encounter troubling ideas on our campus, try to desensitize yourself to emotional reactions by keeping in mind that ideas themselves cannot hurt you.

And please do not try to distract, interrupt, or shout down the speaker. "Deplatforming" a speaker risks making him into a "free speech martyr" who will attract more followers because he is seen as a teller of truths so dangerous that his opponents try to ban them. Just ignore him, rather than spark an explosion that might win him new fans or deepen the ardor of the ones he already has. It is a natural impulse for us to suppress speech that we don’t like, just as it is natural for us to retaliate against or outlaw behavior we don’t like. That’s why we have laws to protect unpopular speech and institutions to foster and study it. You can take this golden opportunity to train yourself to respond to speech that upsets you by listening to it, analyzing it, looking up its sources, developing reasoned counterarguments to it, and considering why people agree with it and whether it might not be as contemptible as you have been told. These are the intellectual skills that all members of our community are committed to building.

In fact, if you’re already committed to everything this speaker is against, then you should definitely listen to him. John Stuart Mill wrote, “He who knows only his own side of the case, knows little of that.” When you never encounter people who vigorously argue for positions you don’t agree with, you may come to believe that those arguments don’t have merit, don’t deserve to be heard, or don’t even exist. The argument you imagine your opponents making is probably weaker and easier to dismiss than the argument they would actually make if they had the chance.

Of course, you don’t have to listen to speakers you disagree with. That’s the beauty of our system: We are all committed to the broad goal of learning, but we are never forced to attend to people we can’t stand. If you want to protest this speaker, do so peacefully, outside the venue, and do not block anyone from attending. Hand out fliers or arrange for other speakers to present counterarguments or different ideas. As Justice Louis Brandeis said, “If there be time to expose through discussion the falsehood and fallacies, to avert the evil by the process of education, the remedy to be applied is more speech, not enforced silence.” And if just being in the speaker’s presence will cause too much discomfort, you may avail yourself of the truly safe space of your dorm room or apartment, or the company of other like-minded students.

Please be careful, though, about making a habit of avoiding or trying to suppress uncomfortable ideas. In the wider world there are no spaces where you can be safe from the thoughts in other people’s heads, so if people are stereotyping you or otherwise judging you unfairly, nothing that restricts speech here on our campus will solve that problem. Holding negative thoughts and uttering negative speech are a part of human nature that our college does not exist to protect you from. On the contrary, we exist to arm you with the intellectual tools to understand, analyze, and dispute incorrect ideas. Shutting off those ideas does nothing to inoculate you against them, and may ironically make you even more vulnerable in the future.

The same is true of our college—of our community—as a whole. Once an organization stops challenging itself with ideas, old or new, it becomes intellectually flaccid and surrenders any claim to scholarly excellence. Valuing comfort and community over openness to ideas is perfectly fine for many organizations. Religions, charitable causes, and political parties are free and sometimes even wise to exclude ideas and people that they disagree with. But the essential common value in a university is free inquiry for the purpose of learning, and by joining the university we have all sacrificed our right to be safe from ideas we disagree with. Community is important here, but openness is fundamental. It’s for that simple reason that I will not disinvite any speaker who has been legitimately invited to talk to us.

Ruth Simmons, the former president of Brown University, told the graduating Smith College class of 2014, “The collision of views and ideologies is in the DNA of the academic enterprise. We don't need any collision-avoidance technology here.” I could not agree more. Therefore, as the leader of this community of scholars, of our academic enterprise, I would be doing the opposite of my duty were I to force silence on this or any other speaker. I hereby decline the requests to disinvite him. And I encourage all campus groups and organizations to invite the speakers they want to hear, knowing that I will respect and support your efforts to learn and engage with their ideas.

Sincerely,

Your College President

NOTE: This was originally published on 8 September 2016 and was revised and updated on 4 June 2017 and 7 April 2018. If you represent a mainstream online or print publication and would like to publish a version of this essay, please contact me.

Confusion About Correlation and Causation ... in a Research Methods Textbook?!

2016-01-24T15:57:00.002-08:00

Every so often, textbook publishers send me free copies of their books. Usually these are books for courses I teach, but sometimes they aren't. This week I received a copy of Discovering the Scientist Within: Research Methods in Psychology from Worth Publishers, a new title in its first edition. I don't teach a research methods course, but I flipped through the table of contents anyhow, and I noticed an entry called "Research Spotlight: The Upside to Video-Game Play" on page 31. Since the question of how playing video games might affect cognition and behavior is a controversial one, I was curious to see what the authors had to say about it in a research methods context. Unfortunately, their discussion has some problems.

First, they claim that "there are some real advantages to playing video games," citing a finding that "more time spent playing video games coincided with greater visual-spatial skills." Stating that playing video games has advantages is a statement of causality. If people who played video games just happened to have greater visual-spatial skills (maybe because their visual-spatial skills were greater than those of non-gamers to start, or because their visual-spatial skills were improving faster than those of non-gamers), there would be no "advantage" to the game-playing. The abstract of the underlying paper by Jackson et al. (2011) makes no mention of random assignment of participants to different amounts of video game play, so there's no justification for inferring causality. (Additionally, it notes that the video-game players had lower GPAs.)

Second, they say that "it just so happens that surgeons benefit from video-game playing as well," citing a study by Rosser et al. (2007) that found that "surgeons who played video games for more than 3 hours a week made 37% fewer errors and were 27% faster in laparoscopic surgery and suturing drills compared to surgeons who never played video games." This is followed by speculation as to the mechanism by which game playing could cause these differences. However, the evidence of causation here is even weaker than for the study of children cited above. It's not even a longitudinal study—it's just a cross-sectional finding of an association between video game play and performance on (computerized) tests of surgical skill. Again, one need only read as far as the Rosser et al. abstract to find the statement "Video game skill correlates with laparoscopic surgical skills." There is no evidence of causality, but the textbook authors have said that surgeons "benefit" from playing video games.

Finally, the caption below the stock photo reinforces the thrust of the boxed text by asking "If video games can make you a better surgeon, what other ares of your life could playing video games improve?" As they say in courtroom dramas, "Objection! Assumes facts not in evidence."

Sure, this is a run-of-the-mill mistake that laypeople make all the time: Confusing evidence of correlation (video game playing co-occurring with increased spatial skill or surgical proficiency) for evidence of causation (playing video games making your spatial skills and surgical proficiency better than they were before). But this is a textbook on research methods in psychology. If the authors of such books have the proverbial "one job to do," it is teaching their readers what conclusions can be drawn from what kinds of evidence. That's what education in research methods is all about: learning to design research studies that have the power to permit certain inferences, and learning which inferences can and cannot logically follow from which designs. You can think of analogies to other fields—a nutrition book reversing the properties of carbohydrates and fat? An algebra textbook getting the quadratic formula wrong? A history book that confuses the Declaration of Independence for the Constitution? Correlation versus causation is not a nuance or side issue; it's at the heart of the behavioral science enterprise.

The authors of Discovering the Scientist Within must understand the distinction between correlation and causation, and I am sure they can generate the plausible alternative (non-causal) explanations for these video game results that I mention above. I know this because on page 30, in the paragraph immediately before the "Research Spotlight" box, they write, "often there is not a set direction of how one thing influences another ... News coverage, such as in cases of school shootings, often portrays playing video games as the cause of aggressive behavior. Yet it is equally likely that aggressive individuals gravitate toward violent video games" [emphasis added].

The fact that mistakes like this can turn up in a book meant to educate its readers to avoid them is remarkable, and I think it goes to show just how confounding sound causal inference can be for the human mind. As Daniel Simons and I argued in The Invisible Gorilla, human beings are susceptible to an "illusion of cause" that leads us to jump to particular causal conclusions in all kinds of situations where the evidence we have doesn't logically justify them—indeed, where other explanations are equally or even more likely, and where the assumption of causality can get us into big trouble. The ease with which we can generate mechanisms to explain a particular causal inference can contribute to the illusion. For example, being aware of the "neural plasticity" concept could make it seem more likely that intensive cognitive work (e.g., video-gaming) might "train" some more fundamental underlying cognitive capacity (e.g., spatial skill) or transfer to some other practical task (e.g., surgical proficiency). None of us, not even psychology professors who write textbooks on research methods, are immune to these fundamental thinking pitfalls.

Hopefully the second edition of Discovering the Scientist Within will correct these correlation/causation errors, as well as any other issues that may lurk in the text. My quick flip-through picked up one more passage the authors might want to think about rewriting:

... the author Malcolm Gladwell, a self-described "cover band for psychology," is known for his ability to summarize and synthesize psychological findings so that the general public can benefit from the exciting advances in knowledge that psychological researchers have made.

Accompanying this sentence on page 41 is a photo of Gladwell's book Outliers. As many readers of this blog will know, I don't agree that the general public is benefitting from Malcolm Gladwell's writing, precisely because he doesn't summarize and synthesize as well as people think he does. Since correlation, causation, and statistical thinking are among the things Gladwell has difficulty with, it doesn't seem like a research methods textbook should be endorsing his work.

Why Phones Need a Driving Mode: Questions and Answers

2015-11-02T10:58:00.002-08:00

Last Friday, The Wall Street Journal published "A Simple Solution for Distracted Driving," an essay that I wrote with Daniel Simons. We argued that all smartphones should come equipped with a Driving Mode—an easily activated or default setting that would prevent users from engaging in the most distracting activities while they were driving their cars. In such a short piece, we were only able to sketch the outlines of this idea. Below are some further elaborations, based on comments we saw or received, organized as a set of questions and answers. (If you want to hear me talk about Driving Mode for a few minutes, you can listen to my radio interview this morning on The Financial Exchange.)

My phone already has Driving Mode.
As we noted, some phones already have features like the one we are proposing, and some of them are similar to our version. However, the vast majority of phones do not have the "robust" driving mode that we advocate. By robust we mean a driving mode that: (1) eliminates all sources of significant distraction, including non-emergency communications; (2) permits full use of GPS and navigation; and (3) sends automatic responses to anyone who tries to contact you while you're driving, and holds the incoming messages without showing them to you until you exit driving mode. Here's an interesting blog post along similar lines to our idea, with more implementation details (we weren't aware of this post when we wrote our essay).

Isn't what you propose the same as Apple Car Play?
No. From what we can tell about Car Play, it's an iOS 9 feature that integrates your phone with a screen built into your car. And it permits the user to do a lot more than our driving mode would. Car Play lets you make and take calls, send and receive messages, etc. Sure, it lets you do this hands-free, but those activities are still very cognitively absorbing even if you don't need your hands to do them.

What about AT&T's "DriveMode" app? (Or similar apps.)
Apps that implement something resembling our robust driving mode are great, and people should use them. AT&T's DriveMode app has several nice features, but lacks some of the ones we think are important. What we'd like to see is a universal driving mode that is fully integrated with the operating system and the hardware of the phone, so that it can have full control over all communication, app use, etc. Third-party apps may not have sufficient control to really implement a robust driving mode, or to stay functional when the operating system or hardware change.

I just put my phone in my purse/briefcase or turn it off, so I don't need Driving Mode.
If you can maintain the discipline to do this, good for you! But this prevents you from using your phone for GPS and navigation, features that probably provide more value than they cost in potential distraction. For users who need those features, or who forget or can't force themselves to put their phones far away while driving, a Driving Mode would help. And some of those less disciplined users are going to be behind the wheel of their own cars while you are on the road.

There are already public service campaigns, laws, and other efforts against distracted driving.
Very true. In an earlier draft of our essay, we mentioned some of these. We aren't against them (except when they are based on erroneous assumptions, such as the idea that hands-free technology will solve the problems of limited attention); we just think that it's so tempting to use a phone while driving, and also so dangerous, that industry can do more to help customers exert self-control.

I like my car's head-up display feature. It feels as though I can read the information on the windshield while keeping my eyes on the road and not driving any worse.
Unfortunately, there is a lot of research on attention in general and head-up displays in particular, and it all generally concludes that this feeling is an illusion. Like the feeling we have that we can talk on the phone (or do even more) while driving, or the feeling people have that they can drive just fine when they are drunk, it reflects a mismatch between the signals our brain uses to monitor its own performance levels, and the reality of those performance levels. Often they line up, but when it comes to knowing how well we are paying attention, we can be way off.

Won't a driving mode that activates based on GPS-detected speed stop passengers from using their phones? And what about people using mass transportation?
We were well-aware of these issues, but in a short essay we didn't have space to address them. And we aren't sure ourselves of the optimal solutions. But we are cognitive psychologists, not mobile operating system designers. We are sure that the geniuses at Apple and Google can come up with clever answers, perhaps working with car makers and phone service providers. Meanwhile, here are a couple of our thoughts:

The key principle is that a phone should have an intelligent default behavior, without preventing users from doing things outside the default. If a phone enters driving mode automatically over 10mph, perhaps it would require just two taps to exit the mode. Passengers or mass-transit riders could do this quickly and easily when a prompt popped up on their screen, but drivers would be less likely to. Some would, of course, but many wouldn't. Many users stick with default settings and never learn how to change them, or learn how but don't bother. The default is not just an arbitrary factory setting: it is also interpreted as a recommendation or a social norm.
Perhaps this feature could be deactivated by a driver before he starts driving, so that driving mode wouldn't start. This simply shifts the burden of action from those who want to be in driving mode to those who don't. Again, there is no God-given standard setting for what features should be available on a phone at what times: making people affirmatively decide to enable distractions seems just as sensible, if not more so, than making them affirmatively decide to disable them.
Even if there is no automatic activation for Driving Mode, its mere existence, combined with the ease of initiating it, should help. Even if only 10% of drivers would turn it on, that would be a win.

Driving Mode is a further step in the infantilization of people by governments and elites. Adults should be aware of what they are and aren't capable of doing, and should be free to choose how to use their phones.

Speaking for myself only, I sympathize greatly with this point of view. I wish people were more aware of their mental capabilities, and I try to educate people about that. I worry quite a bit about the impulse to regulate or forbid behavior that people don't like. Regulations are often put in place on speculation but rarely repealed when their costs turn out to outweigh their benefits. But it seems much more infantilizing to regulate, say, what words people can use, or what Halloween costumes they can wear, or what subjects they can research or study, than to regulate how distracted they can make themselves while driving a car at high speeds. We already regulate many aspects of driving for safety reasons: we restrict speeds, require turn signals, encourage seatbelt use, paint lanes on roads, put up stoplights, and so on. Even if you have perfect self-control and don't need a driving mode, you might agree that other people on the road could benefit from being less distracted, and thus you would benefit too.
One way to look at the situation is this: The invention of the internet and the smartphone have brought countless benefits to everyone. Society is much, much better off with them than without. Compared to the gains we have made in staying connected, having knowledge at our fingertips, and even just being better entertained, the loss from a slightly restrictive driving mode is a very small price to pay. No higher a price, I would think, than what you lose by not being able to drive 70 miles per hour on empty local roads at night, which is a restriction everyone accepts as sensible. While I admire the behavioral technology of the "nudge," which has the power to make people better off without reducing their real options, I also worry that it can be used in inappropriate ways—to push people toward choices that are not in their own true best interests. This, however, is not one of those cases. Eliminating distractions while driving should be in everyone's interests. And finally, note that we are not proposing any new laws or regulations; indeed, regulating the features of smartphones sounds (to me) like a futile exercise. We are only urging the phone industry to think about how to make their products safer and better than they already are.

No, College Lectures Are Not "Unfair"

2015-09-13T20:44:00.001-07:00

In her recent New York Times essay "Are College Lectures Unfair?" Annie Murphy Paul, a science writer, asks "Does the college lecture discriminate? Is it biased against undergraduates who are not white, male, and affluent?" She spends the rest of the essay arguing the affirmative, claiming that "a growing body of evidence suggests that the lecture is not generic or neutral, but a specific cultural form that favors some people while discriminating against others, including women, minorities, and low-income and first-generation college students." She cites various studies that find that "active learning" and "flipped classroom" pedagogical techniques lead to the biggest improvements in performance among the groups who do the worst in traditional lecture-format classes.

Unfortunately, while it may well be true that flipping the classroom (having students watch video lectures at home and using classroom time for problem-solving and discussion) and making learning more active (increasing low-stakes quizzes, which enhance memory for concepts) are excellent ideas—and I personally think they are—the argument that the use of lectures has anything to do with fairness and discrimination is simply erroneous.

Here's why. First, if one group of students tends to perform better than another under a particular instructional regime, then it is likely that any change that improves everyone's performance (as active learning does, according to studies cited by Ms. Paul) should benefit most the group that starts out doing worst. This is a simple function of the fact that grades have maximum values, so there is less room for improvement for students who are already doing well. If this point isn't clear, imagine a hypothetical educational intervention that leads to every student knowing the answer to five particular questions on a 100-question final exam. The best-prepared students would probably have known several of those answers absent the intervention, while the least-prepared students would have good chances of not not having known them. Therefore, the least-prepared students might gain as much as 4 or 5% on their final exam grades, but the best-prepared students might gain as little as 1 or 0%. This would narrow the achievement gap between those groups. Any intervention that results in more learning is likely to have a similar pattern of effects.

So active learning should be good for any students who start out doing worse in general, not just for minority, low-income, or first-generation students. As Ms. Paul notes, "poor and minority students are disproportionately likely to have attended low-performing schools" and get to college "with less background knowledge." Suppose you tried active learning in a school where the students were overwhelmingly the children of white, middle-class, college-educated parents. You would still expect the lowest-performers among those students to benefit the most from the improved methods. Likewise, the best-performing minority students should get less out of improved pedagogy than the worst-performing minority students. In other words, the value of active learning for traditionally underperforming groups of students has everything to do with the fact that those students have underperformed, and nothing to do with the fact that they come from minority groups.

Now to the question of whether older pedagogical approaches "discriminate" against those minority groups, as Ms. Paul says they do. According to the American Heritage Dictionary of the English Language, to discriminate is "to make distinctions on the basis of class or category without regard to individual merit, especially to show prejudice on the basis of ethnicity, gender, or a similar social factor." Teaching a course in an old-fashioned lecture format, though it may often be less effective than teaching in flipped or active formats, and for that reason may result in lower grades for some types of students than for others, makes no distinctions between classes or categories of people, and therefore cannot be a form of discrimination. Indeed, any inferior form of instruction should lead to wider differences between students who start at different levels.

Consider the limiting case: imagine a "form of instruction" that is no instruction at all: the professor never issues a syllabus or even shows up, but still gives the final exam at the end of the semester. Students who started out with the most "background knowledge" about the topic will still know the most, and students who started out knowing the least will still know the least. All pre-existing differences between ethnic groups and any other student groups will remain unchanged. On the other end of the instructional spectrum, suppose the professor's teaching is so perfect that every student learns every bit of the material: then there will be no differences between any groups, because all students will receive grades of 100%. Note that none of this concerns what groups the students are part of—it can all be explained entirely by the artifact of high-quality instruction benefiting poorer-performing students more than better-performing students. Therefore, Ms. Paul's essay uses the words "biased" and "discriminating" incorrectly, with the pernicious effect of accusing anyone who doesn't flip their classroom or give lots of quizzes of being prejudiced against minority students.

For what it's worth, I have been truly impressed by the growing body of research on the science of learning. I think it's one of the most exciting practical achievements of cognitive psychology, and I am trying to incorporate more of it into my own teaching. But I also believe that good, engaging lectures have their place, and may be more effective in some disciplines than others. To be clear, I have no objections to any of the research Ms. Paul cites in her essay. (Indeed, I have not even read most of that research, because what's at issue here is not the scientific results, but the meaning Ms. Paul ascribes to them. I've assumed that she described all the results accurately in her essay. If the researchers somehow managed to separate the effects of minority status and baseline knowledge or performance in their analysis of the active learning effects, Ms. Paul doesn't say anything about how they did it—and since that analysis would be the logical lynchpin for her claims of bias, discrimination, and unfairness, it is negligent to ignore it.) We have learned a lot about effective teaching methods, but none of it justifies the sloppy, inflammatory claim that lectures are "biased" and "discriminate" against students from minority, low-income, or nontraditional backgrounds.

Martin Thoresen's World Chess Championship

2015-01-10T15:06:00.001-08:00

My third “Game On” column, "The Real Kings of Chess Are Computers," appears this weekend in The Wall Street Journal. I write about the "real world chess championship," which is known formally as the Thoresen Chess Engines Competition, or TCEC. This is a semi-annual tournament that pits almost all the top computer chess programs against one another. Since the best chess engines are now much stronger than even the best human players, a battle between the top two engines is a de facto world championship of chess-playing entities.

That battle was the Superfinal match of TCEC season 7, and it was won last month by Komodo over Stockfish (both playing the same 16 core computer). In a digital-only extra, "Anatomy of a Computer Chess Game," I try to explain a key moment in game 14 of the match, which gave Komodo a lead it never relinquished over the remaining 50 games.

As part of the research for these pieces, I interviewed TCEC impresario and eponym Martin Thoresen by email. Below is an edited transcript of our conversation, which took place between 29 December 2014 and 2 January 2015. The questions have been re-ordered to make the flow more logical.

CHRISTOPHER CHABRIS: Let’s start with the recent Season 7 Superfinal match. What is your opinion about the result? Do you think it shows that Komodo is a “better chess player” than Stockfish, in their current versions?

MARTIN THORESEN: I think the Superfinal was very close and exciting. The draw rate was slightly higher than what I expected, but then again the engines are very close in strength so this is quite natural. I think the result shows that Komodo is the better engine on the kind of hardware that TCEC uses. And for grandmasters with powerful computers this should be something to take note of when they analyze games using chess engines.

Do you believe that TCEC features the “best chess players” in the world?

Yes, I would say any of the top programs of say, Stage 3 and onwards would pretty much crush any human player on the planet using TCEC hardware.

Do you think it is a problem to have so many draws (53 out of 64 games)? It definitely distinguishes engine-engine matches from human-human matches to have so many draws, but I agree with you that it must result partly from the players being stronger than the best humans.

Personally I don’t mind the draw rate being this high in the Superfinal, it makes it very tense. But one of the main goals of TCEC is to entertain people. Too many draws defer from that and too many one-sided openings would lower the quality overall, even if it lowers the draw rate. I would be satisfied with a draw rate of roughly 75% in the Superfinal.

You must have watched more engine-engine games than almost anyone else. Were there any games or particular moves or positions that you thought were especially beautiful or revealing in this most recent Superfinal match?

I have not looked deeply at all the games yet, but games like #9 strike me as fascinating.

Let’s talk about some of the details of how TCEC works. Are the games played entirely on your personal computer at your home?

Yes, it’s a 16-core server I’ve built myself. It has two 8-core Intel Xeon processors and 64 GB RAM. It’s located at home here in Huddinge, a suburb of Stockholm, Sweden. I live in an apartment of about 45 square meters.

Why do the games run only one at a time? Because it all happens on one computer? Have you considered using multiple computers so that more games can happen at one time?

Yes exactly, they run only one at a time because the engines utilize all 16 cores to get maximum power, which makes it impossible to run more games. Using more computers is of course something I wish I could do, but then people need to donate more. ☺ The server cost me roughly €4000–€5000 to build. Of course it would be possible to limit each engine to say, four cores, then I could have four games running simultaneously, but then again the engines would be weaker due to the fewer cores. I want TCEC to show only the highest quality of games. Not to mention that I’d have to redesign the website to support many games at once.

How hard was it to write the code that “plays” the two engines against each other, passing moves back and forth, and so on? Do the engines provide you with an API, or do the engine authors give you a special version that corresponds to an API for your own server code? (I assume you wrote the server code yourself too, correct?)

The interface that plays the games is a small command line tool called cutechess-cli, but somewhat modified for TCEC by Jeremy Bernstein after my instructions. I have not coded this tool. Cutechess is simply a UCI/Xboard interface tool that “runs” the engines in accordance with the UCI or Xboard specifications. Basically all chess engines comply with the UCI or Xboard protocols for I/O requests (time control, time left, the move it makes, etc.). Using this tool does not give you a chessboard to view the action like a GUI (Fritz, Arena, SCID, etc.) so ironically I can’t actually watch the game on the server—all I see is a bunch of text.

Who developed the software to broadcast the games to the internet? As someone who followed the latest Superfinal and browsed the archives quite a bit, I can say that it has a very nice interface.

There are two parts of TCEC. One is the website which shows the games, the other is the server on which the games are played. These two are not run on the same machine (for obvious performance reasons), so the server uploads the PGN to the website each minute. The website is designed by me and it has had different designs in previous seasons. The core technology on which it is built is the free JavaScript chess viewer called pgn4web.

How much money would you estimate you have personally spent, and how much total has been spent, to run the TCEC since it started, and season 7 specifically?

I have spent a lot of money. I am not quite sure how much, but I would estimate €6000–€7000 since TCEC started (hardware upgrades, power bills, etc.).

How many hours do you spend on it out of your own life?

For Season 7 I didn’t really code anything new for the website compared to Season 6, so I didn’t spend much time preparing this time around. But when I made the new (current) website for Season 6, I started right after Season 5 finished and coded for almost 3 months straight, sometimes as much as 4–6 hours a day. That left little sleep considering I had (and still have) a full time job as well. But when a season is running, my attention goes mostly to moderate the chat and making sure the hardware runs as it should. So everything from 0–4 hours per day during a season.

Are there any major engines that did not participate over the past few seasons? If so, do you know why they declined?

I pick the engines myself, but there was the case of HIARCS for Season 6, where the programmer Mark Uniacke told me to withdraw it. I only did it because I did not buy his program—he sent it to me for free for Season 5. But if I had bought it myself, I would have included it. Other than HIARCS there have not really been any similar cases in TCEC history. Now and then the question of why Fritz does not participate pops up, but that has a simple answer: It does not come in a form that supports UCI or Xboard—it has a native protocol built into the Fritz GUI which makes it unusable.

If I understand correctly, your goal is to include every major engine, and the only reasons they could be left out is (a) their authors explicitly withdraw them, or (b) they aren’t compatible with the required protocols. Do I have that right? And that HIARCS and Fritz are the only major engines not participating?

Yes, every major engine that is not a direct clone. The whole clone debate is a hot topic in most computer chess forums. So your (a) and (b) are both correct. HIARCS was not a part of Season 7 for the same reason as it was not a part of Season 6.

Has there been any recent criticism of the TCEC from chess engine developers that were not included (Fritz), or sat out (HIARCS), or others?

No, there has not.

How strong a chess player are you? Do you play in tournaments, a club, or online?

I am not very strong. I don’t even have a rating. I would estimate my strength at around 1500 FIDE on a good day.

Can you tell me a bit about yourself?

I am 33 years old and living with my dog. (For now!) I am currently working as an IT consultant and for the past 1.5 years I’ve worked for Microsoft as part of their international Bing search engine MEPM team. I have no formal education apart from what would equal high school in the U.S. Everything I’ve done so far is self-taught.

How many other people help regularly in organizing and running the TCEC? Are they all volunteers?

Nelson Hernandez is in charge of the openings, assisted by Adam Hair and international master Erik Kislik. Jeremy Bernstein has helped me with the cutechess-cli customization. Paolo Casaschi (author of pgn4web) has also helped me with some specific inquires I’ve had about JavaScript code. They are all volunteers. ☺

How did the idea for the TCEC come to you?

Basically it started after I left the computer chess ranking list (CCRL) after a couple of years of being a member. I was tired of just running computer chess engines games for statistics—I wanted to slow down the time control and watch the games. Obviously, the idea of a live broadcast wasn’t new, and in the beginning it was very simple, just a plain website with moves and not much else. It has now evolved with a more advanced website that I think is kind of intuitive and nice to use and gives TCEC a kind of unique platform.

Why is there so little time between TCEC seasons? Why not one season per year, more like the human world championship? Do the engines change enough between seasons for such frequent seasons to be meaningful?

The rhythm the past few years has been roughly two seasons per year. One season takes 3–4 months, so basically you can watch TCEC for half a year per year. It is definitely debatable whether this is useful or meaningful, but that’s just how it has been. Of course, this might change in the future. I have no other good answer. ☺

What are your plans for the future of TCEC, short-term and long-term?

Short-term would be to take a (well deserved) break. ☺ Long-term would be to be recognized by some big company to “get the ball rolling.”

Are you planning any changes in the format or rules for Season 8?

There might be changes for Season 8. Nothing is decided yet.

Regarding rules, while following the Superfinal games I noticed that some games were declared drawn by the rules when there seemed to be a lot of life left in the position—for example, the final position of game 18, which human grandmasters might play on for either side. Do you think this rule might be revised?

I don’t think the TCEC Draw Rule or TCEC Win Rule will be changed. They have been there from the start (slightly modified since the beginning) and no one is really complaining. As for the particular example with game 18, both engines are 100% certain that this is a draw (both show 0.00) so even if we humans think it looks chaotic, the engines simply have it all calculated way in advance.

I noticed that endgame tablebases were not used in the Superfinal, and this must have resulted in some incorrect evaluations. For example, as I was watching one game, I saw that one engine’s principal variation ended in a KRB-vs-KNN position, which is a general win for the stronger side, but the evaluation was not close to indicating a forced win. Do you think that could have helped cause more draws to happen?

That is correct, tablebases were disabled for all engines for the whole of Season 7. Previously they had been available, but some fans wanted them disabled so I figured they would have their wish fulfilled for Season 7. What tablebases do is to basically help the engines find the correct way into a winning endgame—or in worst case scenario, prevent a loss. It shouldn’t affect the draw rate overall since it would even out in the end. But the point is that without tablebases, the engines can only rely on their own strength in the endgame and the path for getting there.

Have you thought of inviting strong players to comment on the games live, as happens in the top human-versus-human tournaments and matches? Is it too expensive?

We’ve had some discussions, but nothing concrete yet. It could probably be something to do for the Superfinal if the required money could be arranged.

Have you approached any major companies like Intel, AMD, or Microsoft about sponsoring the event or making it much bigger in scope/publicity?

Not in a while. Back when I did, I got no reply or acknowledgment whatsoever.

Do you have data on how many people in total looked at the latest Superfinal on tcec.chessdom.com, and any other rough numbers on chat commenters, etc.?

There were approximately 26,000 unique visitors there during the Superfinal. From memory, the number of users in the chat peaked at roughly 600 at one point during the match.

Do you think that the chess world should pay more attention to TCEC in particular, and to engine-versus-engine games in general? They are rarely quoted in discussions of opening theory, or of the best games, best moves, or most interesting positions. Do you have an opinion about why this is?

I think they should. There are so many beautiful games coming out of TCEC that can blow one’s mind. Why we see little reference to engine-versus-engine games is hard to say, but my guess is that it related to the fact that a chess engine is basically an A.I., so people might have a hard time admitting that “a robot” can play even more beautiful chess than humans.

What intrigues me most about TCEC may be the fact that it is a very personal project for you, yet it has attained a measure of worldwide respect and fame without having a big sponsor or lots of money involved.

This project is of course very personal. Anton Mihailov of chessdom.com contacted me prior to Season 5 and we have continued our cooperation since. To have a hobby being acknowledged like that is of course very nice. With that said, if Intel or AMD or any other big company would be interested in sponsoring TCEC I would definitely be interested in having a talk with them too. Bottom line is: Most people regard TCEC as the official “world computer chess championship.” And I don’t think they are wrong about that! ☺

My thanks to Martin Thoresen, grandmaster Larry Kaufman (of the Komodo team), international master Erik Kislik (who made the final selection of openings for the match), and everyone else who answered my questions for these pieces. I am looking forward to Season 8 of TCEC!

More on "Why Our Memory Fails Us"

2014-12-02T06:00:00.000-08:00

Today the New York Times published an op-ed by Daniel Simons and myself, under the title "Why Our Memory Fails Us." In the article, we use the recent discovery that Neil deGrasse Tyson was making incorrect statements about George W. Bush based on false memories as a way to introduce some ideas from the science of human memory, and to argue that we all need to rethink how we respond to allegations or demonstrations of false memories. "We are all fabulists, and we must all get used to it" is how we concluded.

In brief, Tyson told several audiences that President Bush said the words "Our God is the God who named the stars" in his post-9/11 speech in order to divide Americans from Muslims. Sean Davis, a writer for the website The Federalist, pointed out that Bush never said these exact words, and that the closest words he actually said were spoken after the space shuttle explosion in 2003 as part of a tribute to the astronauts who died. Davis drew a different conclusion than we did—namely that the misquotes show Tyson to be a serial fabricator—but he brought Tyson's errors to light in a series of posts at The Federalist, and he deserves credit for noticing the errors and inducing Tyson to address them.

Tyson first responded, in a Facebook note, by claiming that he really did hear Bush say those words in the 9/11 context, but he eventually admitted that this memory had to be incorrect.

All this happened in September. After reading Tyson's response, I wondered why it didn't include a simple apology to President Bush for implying that he was inciting religious division. On a whim I tweeted that Tyson should just apologize and put the matter behind him:

.@neiltyson Why not say "My apologies to George W. Bush for repeatedly & completely misstating what he meant in that quote about 'our God.'"
— Christopher Chabris (@cfchabris) September 28, 2014

I had never met or communicated with Neil deGrasse Tyson, and I doubt he had any idea who I was, so it was somewhat to my surprise that he replied almost immediately:

@cfchabris Thanks. Sure, I plan to say something like that soon. I’m looking for a good medium & occasion.
— Neil deGrasse Tyson (@neiltyson) September 28, 2014

A few days later, Tyson issued his apology as part of another Facebook note entitled "A Partial Anatomy of My Public Talks." Hopefully it is clear that we wrote our piece not to pick apart Tyson's errors or pile on him, but to present the affair as an example of how we can all make embarrassing mistakes based on distorted memories, and therefore why our first reaction to a case of false memory should be charitable rather than cynical. Not all mistaken claims about our past are innocent false memories, of course, but innocent mistakes of memory should be understood as the norm rather than the exception.

The final version of the op-ed that we submitted to the New York Times was over 1900 words long; after editing, the published version is about 1700 words. Several pieces of information, including the names of Davis and The Federalist—who did a service by bringing the matter to light—were casualties of the condensation process. (A credit to ourselves for the research finding that most people believe memory works like a video camera was also omitted.) We tried to leave it clear that we deserve no credit for discovering Tyson's misquote. In our version there were also many links that were omitted from the final online version. In particular, we had included links to Davis's original Federalist article, Tyson's first reply, and Tyson's apology note, as well as several of the research articles we mentioned.

For the record, below is a list of all the links we wanted to include. Obviously there are others we could have added, but these cover what we thought were the most important points relevant to our argument about how memory works. For reasons of their own, newspapers like the Times typically allow few links to be included in online stories, and prefer links to their own content. Even our twelve turned out to be too many.

Neil deGrasse Tyson's 2008 misquotation of George W. Bush (video)

Bush's actual speech to Congress after 9/11 (transcript)

Bush's 2003 speech after the space shuttle explosion (transcript)

Sean Davis's article at The Federalist

Tyson's initial response on Facebook

Tyson's subsequent apology on Facebook

National Academy of Sciences 2014 report on eyewitness testimony

Information on false convictions based on eyewitness misidentifications from The Innocence Project (an organization to which everyone should consider donating)

Roediger and DeSoto article on confidence and accuracy in memory

Simons and Chabris article on what people believe about how memory works

Registered replication report on the verbal overshadowing effect

Daniel Greenberg's article on George W. Bush's false memory of 9/11

GAME ON — My New Column in the Wall Street Journal

2014-11-10T15:13:00.000-08:00

For a while I’ve had the secret ambition to write a regular newspaper column. At one time I thought I could write a chess column; at other times I thought that Dan Simons and I could write a series of essays on social science and critical thinking. Last year I suggested to the Wall Street Journal a column on games. They turned me down then, but a few weeks ago I gently raised the idea again and the editors kindly said they would give it a try. So I’m excited to say that the first one is out in this past weekend’s paper (page C4, in the Review section), and also online here.

The column is about Dan Harrington's famous "squeeze play" during the final table of the 2004 World Series of Poker main event. Here's how ESPN covered the hand (you can see in the preview frame that he was making a big bluff with his six-deuce):

There were several things about this hand that I would have mentioned if I had the space. First, a couple of important details for understanding the action:

Greg Raymer, the ultimate winner, started the hand with about $7.9 million in chips. Josh Arieh, who finished third, had $3.9 million. Harrington had $2.3 million, the second smallest stack at the table.
Seven players remained in the tournament (of the starting field of about 2500) when this hand was played. At a final table like this, the prize payouts escalate substantially with each player eliminated. This might explain why Harrington put in half of his chips, rather than all of them. In case he got raised or called and lost the hand, he would still have a bit left to play with, and could hope to move up the payout chart if other players busted before he did.
David Williams, the eventual runner-up, was actually dealt the best hand of anyone: he had ace-queen in the big blind. But facing a raise, a call, and a re-raise in front of him, he chose to fold, quite reasonably assuming that at least one of the players already in the hand would have had him beat, and perhaps badly—e.g., holding ace-king. For reasons of space and simplicity I had to omit Williams from the account in the article. I also omitted the suits of the cards.
Dan Harrington is a fascinating character. He excels at chess, backgammon, and finance as well as poker, and he wrote a very popular series of books on hold'em poker with Bill Robertie (himself a chess master and two-time world backgammon champion). He won the World Series of Poker main event in 1995. After his successful squeeze play in 2004 he wound up finishing fourth. He had finished third the year before.

Some people have noted that this could not have been the very first squeeze play bluff ever in poker. And of course it wasn't. But it was, in my opinion, the most influential squeeze play. Because ESPN revealed the players' hole cards, it was verifiably a squeeze play. As I hinted in the article, without the hole card cameras, everyone watching the hand would have assumed that Harrington had a big hand when he raised. Even if Harrington had said later that he had a six-deuce, some people wouldn't have believed him, and no one could have been sure. Once ESPN showed this hand (and Harrington wrote about it in his second Harrington on Hold'em volume), every serious player became aware specifically of the squeeze play strategy, and generally of the value of re-raising "light" before the flop. And because the solid, thinking man's player Dan Harrington did it, they knew it wasn't just the move of a wild man like Stu Ungar, but a key part of a correct, balanced strategy.

Of course, the squeeze play doesn't work every time. It would have failed here if Arieh (or Raymer) really did have big hands themselves. Harrington probably had a "read" that suggested they weren't that strong, but I think this read would have been based much more on his feel for the overall flow of the game—noticing how many pots they were playing, whether they had shown down weak hands before—than on any kind of physical or verbal tell.

Two years after Harrington's squeeze play, Vanessa Selbst was a bit embarrassed on ESPN when she tried to re-squeeze a squeezer she figured she had caught in the act. At a $2000 WSOP preliminary event final table, she open-raised with five-deuce, and drew a call followed by a raise: the exact pattern of the squeeze play. After some thought she went all-in, but the putative squeezer held pocket aces. Selbst was out in 7th place. But she didn't stop playing aggressively, and since then she has become one of the top all-time money winners, and most respected players in poker. Most of the hand is shown in the video below, starting at about the 6:50 mark.

Some readers of the column asked whether I wasn't just describing a plain old bluff, the defining play of poker (at least in the popular mind). The answer is that the squeeze play is a particular kind of bluff—indeed, a kind of "pure bluff," which is a bluff in which your own hand had zero or close to zero chance of actually wining on its merits. (A "semi-bluff," by contrast, is a bluff when you figure to have the worst hand at the time of the bluff, but your hand has a good chance of improving to be the best hand by the time all the cards are out.) What the Harrington hand showed is a particular situation in which a pure bluff is especially likely to work. Pros don't bluff randomly, or when they feel like it, or even when they think they have picked up a physical tell. And they especially don't bluff casually when more than one opponent has already entered the hand. Harrington's bluff was more than just a bluff: It was a demonstration of how elite players exploit their skills to pick just the right spots to bluff and get away with it.

In future columns I’ll talk about different games, hopefully with something interesting to say about each one. The next column should appear in the December 6–7 issue, and will probably concern the world chess championship. My “tryout” could end at any time, of course, but for now my column should be in that same space once per month, at a length of about 450 words. As most of you know, it’s a challenge to say something meaningful in so few words, and for me it’s a challenge just to stay within that word limit while saying anything at all. As in poker, I may need a bit of luck.

By the way, I think it's too bad that the New York Times decided to end their chess column last month. I believe, or at least hope, that there is a market for regular information on games like chess for people who don't pay so much attention via other websites and publications. I remember reading Robert Byrne's version of the Times column in the 1970s. I would get especially excited when my father came home on one of the days the column ran, so that I could grab his newspaper and check it out. Yes, it used to run at least three times per week, then two, then just on Sundays (when Dylan McClain took over with a different, and I think better, approach from Byrne's). Now it doesn't run at all. The Washington Post ended its column as well, but some major newspapers still have one (The Boston Globe and New York Post come to mind).

PS: If you liked the squeeze play column, here are some of my other pieces on games that you can read online, in reverse chronological order:

"The Science of Winning Poker" (WSJ, July 2013)

"Should Poker Be (A Tiny Bit) More Like Chess?" (this blog, August 2013)

"Chess Championship Results Show Powerful Role of Computers" (WSJ, November 2013)

"Bobby Fischer Recalled" (WSJ, March 2009)

"It's Your Move" (WSJ, October 2007)

"How Chess Became the King of Games" (WSJ, November 2006)

"The Other American Game" (WSJ, July 2005)

"A Match for All Seasons" (WSJ, December 2002)

"Checkmate for a Champion" (WSJ, November 2000)

"Data Journalism" on College ROI at FiveThirtyEight: Where's the Critical Thinking?

2014-03-28T18:33:00.000-07:00

NOTE: See the end of this entry for important updates, including one from 11/9/15.

A website called PayScale recently published a "College ROI Report" that purports to calculate the return on investment (ROI) of earning a Bachelor's degree from each of about 900 American colleges and universities. I found out about this report from an article on Nate Silver's new FiveThirtyEight website. The article appears under a banner called "DataLab," implying that it is an example of the new "data journalism" that Silver and his site are all about. Unfortunately, the article contains approximately zero critical thinking about the meaning of the PayScale report, its data sources, and its conclusions.

PayScale did a lot of number-crunching (read all about it here), but the computation resulted in two key numbers for each institution: (1) the cost of getting an undergraduate degree, taking into account factors like financial aid and time to graduation; and (2) the expected total earnings of a graduate over the next twenty years. The first one can be figured out from public data sources. The second one came from a survey by PayScale (more on this later). The ROI for a college was calculated by subtracting #1 from #2, and then further subtracting the expected total earnings of a person who skipped college and worked for 24–26 years instead (which happens to be about $1.1 million). The table produced by PayScale thus purports to show how much you would get back—in monetary income—on the "investment" of obtaining a degree from any particular college or university.

Indeed, PayScale says that "This measure is useful for high school seniors evaluating their likely financial return from attending and graduating college." But this is simply not true. As I read the FiveThirtyEight article on the PayScale report, I was waiting for them to point out the reasons why, but they never did. The only critical comments were about incorporating the effects of student debt.

What are the problems with the PayScale analysis? First of all, it only makes sense to speak of the comparative return on an investment when the investors have a choice of what to invest in. If every person could choose to attend any college (and to graduate from it and get a full-time job), or to skip college entirely, then it would be meaningful to ask which choice maximizes return. This is what we do when calculating a financial ROI: we try to figure out whether investing in stocks versus bonds, or one mutual fund versus another, or one business opportunity versus another, will be more profitable. But colleges have admissions requirements, so not everyone can go to whatever college he or she wants. Colleges select their students as much as students select their colleges. And in fact, the people who attend different colleges can be very different, and they can be even more different from the people who don't attend college at all.

This means that the Return in this "ROI" depends on much more than the Investment. It also depends on who is doing the investing. In fact, it is far from trivial to figure out the true ROI of going to Harvard versus Vanderbilt versus Wayland Baptist versus Nicholls State versus not attending college at all. To figure this out, you would have to control in the analysis for all the characteristics that make students at different colleges different from one another, and different from students who don't go to college. Factors like cognitive ability, ambition, work habits, parental income and education, where the students grew up and went to high school, what grades they got, and many others are likely to be important. In fact, those other factors could be so important that they might wind up explaining more of the variation in income between people than is explained by going to college—let alone which particular college people go to.

Even controlling for data we might be able to obtain, like the average SAT score and parental income of students who attend each college, would not completely solve the problem, because there could be factors that we can't measure that have an important effect. Only by randomly assigning students to different colleges (or to directly entering the workforce after high school) would we get an estimate of the true ROI (measured in money—which of course leaves aside all the other benefits one might get from college that don't show up in your next twenty years of paychecks).

Of course this ideal experiment won't ever happen, but clever researchers have tried to approximate it by doing things like looking at students who were accepted to both a higher-ranked and a lower-ranked school, and then comparing those who enroll in the higher-ranked one to those who enroll in the lower-ranked one. Since all the students in this analysis got into both schools, the problem of different schools having different students is mitigated. (Not erased entirely, though: for example, people who deliberately attend lower-ranked schools might be doing so because of financial circumstances, or their college experience may differ because they are likely to start out above average in ability and preparation for the school they attend, as compared to those who choose higher-ranked schools.)

FiveThirtyEight said nothing about this fundamental logical problem with the entire PayScale exercise. Nor did it address the other flaws in the analysis and presentation of the data.

It could have also asked about the confidence intervals around the ROI estimates provided by PayScale. When you give only point estimates (exact values that represent just the mean or median of a distribution), and proceed to rank them, you create the appearance of a world where every distinction matters—that the school ranked #1 really has a higher ROI than #2, which is higher than #3, and so on. PayScale's methdology page says, "the 90% confidence interval on the 20 year median pay is ±5%" (but 10% for "elite schools" and "small liberal arts schools or schools where a majority of undergraduates complete a graduate degree"). The narrowness of these intervals is a bit hard to believe, as well as their uniformity (how does every school in a category get the same confidence interval?). Why not just put the school-specific confidence intervals into the report, so that it is obvious that, for example, school #48 (Yale) is probably not significantly higher in ROI than, say, school #69 (Lehigh), but is probably lower in ROI than school #6 (Georgia Tech)?

It's hard to have much confidence in these confidence intervals anyhow, since we don't know how many people PayScale surveys at each college to make the income calculations (which will be the critical drivers of the variability in ROI). Many of the colleges are small; how reliable can the estimates of what their graduates will earn be? And are the surveys of college graduates unbiased with respect to what field the graduates work in? Or, for example, do engineers and teachers tend to respond to these surveys more than, say, baristas and consultants? The unemployed and under-employed are not included; this will have the effect of inflating the apparent ROI of schools whose graduates tend, for whatever reasons, not to have full-time jobs. Payscale says that non-cash compensation and investment income are not included, which might bias down the reported ROI of graduates of elite schools who go into financial careers.

Finally, perhaps FiveThirtyEight could have looked at whether the schools that stand out at either end of the distribution happen to be smaller than the ones in the middle. Ohio State, Florida State, et al. have so many students, drawn from such a broad distribution of ability and other personal traits, that they should be expected to have "ROI" values nearer to the middle of the overall distribution of universities than should small colleges, which through pure chance (having, by luck, more high- or low-income graduates) are more likely to land in the top or bottom thirds of the list. Some degree of mean reversion may be expected, so the rankings of PayScale will lose some predictive value for future ROIs, especially in the case of small schools.

The comments I have made all concern the underlying PayScale report, but I think it is FiveThirtyEight that has not upheld the best standards of "data journalism." If that term is to have any meaning, it can't simply refer to "journalism" that consists of the passing along of other people's flawed "data" (especially when those people are producing and promoting the data for commercial purposes). Nate Silver earned his reputation, and that of his FiveThirtyEight brand, largely by calling out—and improving on—just this kind of simplistic and misleading analysis. It's sad to see his "data journalism organization" no longer criticizing superficiality, but instead promoting it.

Postscripts: 3/29/14: After I first posted this piece, I realized three things. First, I hadn't mentioned mean reversion originally, so I added it in. But it's a minor issue compared to the others. Second, I didn't make it clear that notwithstanding what I wrote above, I am 100% in favor of more good data journalism. I agree with Nate Silver and others that journalists (and everyone!) should be more aware of the data that exists to answer questions, how to gather data that has not already been compiled, how to think about data, and so on. A great example of silly data-ignorant journalism is the series of articles the New York Post has been running on the "epidemic" of suicides and suspicious deaths in the financial industry. The proper question to start with is whether there is an epidemic, or even a significant excess over normal variation, as opposed to a set of coincidences that would be expected to happen every so often. Perhaps there is an epidemic, but I am skeptical. The Post (and other outlets that have reported on these deaths) skip right over this crucial threshold issue. Maybe FiveThirtyEight could address it and teach its readers about the danger of jumping to conclusions after seeing nonexistent patterns in noise. Third, and finally, I should have mentioned that FiveThirtyEight has on board some people who really do know how to think seriously about data (and do it much better than I do), such as the economist Emily Oster. I hope Emily's influence will spread throughout the organization. 3/30/14: I removed text in the original version that asked whether outliers like hedge fund managers had their incomes included in PayScale's calculations. They won't have too much influence, regardless, because PayScale is reporting medians, not means. My apologies for the inadvertent error. 4/5/14: I changed the number of colleges included from 1310 to "about 900." There are 1310 entries in Payscale's table, but many colleges are listed more than once if they have different tuition options (e.g. state resident versus non-resident). 4/7/14: I added links to the Krueger & Dale (and Dale & Krueger) economics papers that tried to estimate the returns from attending more selective/elite colleges. I knew about these papers when I wrote the initial post, but had forgotten who the authors were.

Addendum, 11/9/15: In an article at washingtonpost.com, Nate Silver is quoted as saying the following when comparing his FiveThirtyEight site to Vox, one of his main competitors:

I think the best five or ten things they do are terrific, right? They have some great people working for them. I think they also have a lot of less than terrific things … I know how hard my writers and my editors work to try and get get the facts right, to not always go for the hot take that you can’t really provide evidence for, right? To avoid errors and mistakes. And so, you know, I obviously have some skin in the game where I feel like if people are taking a lot of shortcuts and things that have the sheen of being data driven and maybe aren’t very empirical and aren’t very self aware, then, yeah, I guess I get really annoyed.

I think "taking a lot of shortcuts and things that have the sheen of being data driven and aren't very empirical and aren't very self aware" is an excellent description of the FiveThirtyEight piece on PayScale's completely misleading ROI analysis. And the piece remains on the site, as far as I can tell just as it was when I wrote this entry, with no corrections or updates or qualifications of its superficial and non-self-aware reporting. But at least it wasn't published on Vox!

Why Malcolm Gladwell Matters (And Why That's Unfortunate)

2013-10-04T17:56:00.001-07:00

Malcolm Gladwell, the New Yorker writer and perennial bestselling author, has a new book out. It's called David and Goliath: Misfits, Underdogs, and the Art of Battling Giants. I reviewed it (PDF) in last weekend's edition of The Wall Street Journal. (Other reviews have appeared in The Atlantic, The New York Times, The Guardian, and The Millions, to name a few.) Even though the WSJ editors kindly gave me about 2500 words to go into depth about the book, there were many things I did not have space to discuss or elaborate on. This post contains some additional thoughts about Malcolm Gladwell, David and Goliath, the general modus operandi of his writing, and how he and others conceive of what he is doing.

I noticed some interesting reactions to my review. Some people said I was a jealous hater. One even implied that as a cognitive scientist (rather than a neuroscientist) I somehow lacked the capacity or credibility to criticize anyone's logic or adherence to evidence. A more serious response, of which I saw several instances, came from people who said in essence "Why do you take Gladwell so seriously—it's obvious he is just an entertainer." For example, here's Jason Kottke:

I enjoy Gladwell's writing and am able to take it with the proper portion of salt ... I read (and write about) most pop science as science fiction: good for thinking about things in novel ways but not so great for basing your cancer treatment on.

The Freakonomics blog reviewer said much the same thing:

... critics have primarily focused on whether the argument they think Gladwell is making is valid. I am going to argue that this approach misses the fact that the stories Gladwell tells are simply well worth reading.

I say good for you to everyone who doesn't take Gladwell seriously. But the reason I take him seriously is because I take him and his publisher at their word. On their face, many of the assertions and conclusions in Gladwell's books are clearly meant to describe lawful regularities about the way human mental life and the human social world work. And this has always been the case with his writing.

In The Tipping Point (2000), Gladwell wrote of sociological regularities and even coined new ones, like "The Law of the Few." Calling patterns of behavior "laws" is a basic way of signaling that they are robust empirical regularities. Laws of human behavior aren't as mathematically precise as laws of physics, but asserting one is about the strongest claim that can be made in social science. To say something is a law is to say that it applies with (near) universality and can be used to predict, in advance, with a fair degree of certainty, what will happen in a situation. It says this is truth you can believe in, and act on to your benefit.

A blurb from the publisher of David and Goliath avers: "The author of Outliers explores the hidden rules governing relationships between the mighty and the weak, upending prevailing wisdom as he goes." A hidden rule is a counterintuitive, causal mechanism behind the workings of the world. If you say you are exploring hidden rules that govern relationships, you are promising to explicate social science. But we don't have to take the publisher's word for it. Here's the author himself, in the book, stating one of his theses:

The fact of being an underdog changes people in ways that we often fail to appreciate. It opens doors, and creates opportunities and educates and permits things that might otherwise have seemed unthinkable.

The emphasis on changes is in the original (at least in the version of the quote I saw on Gladwell's Facebook page). In an excerpt published in The Guardian, he wrote, "If you take away the gift of reading, you create the gift of listening." I added the emphasis on create to highlight the fact that Gladwell is here claiming a causal rule about the mind and brain, namely that having dyslexia causes one to become a better listener (something he says made superlawyer David Boies so successful).

I've gone on at length with these examples because I think they also run counter to another point I have seen made about Gladwell's writings recently: That he does nothing more than restate the obvious or banal. I couldn't disagree more here. Indeed, to his credit, what he writes about is the opposite of trivial. If Gladwell is right in his claims, we have all been acting unethically by watching professional football, and the sport will go the way of dogfighting, or at best boxing. If he is right about basketball, thousands of teams have been employing bad strategies for no good reason. If he is right about dyslexia, the world would literally be a worse place if everyone were able to learn how to read with ease, because we would lose the geniuses that dyslexia (and other "desirable difficulties") create. If he was right about how beliefs and fads spread through social networks in The Tipping Point, consumer marketing would have changed greatly in the years since. Actually, it did: firms spent great effort trying to find "influentials" and buy their influence, even though there was never good causal evidence that this would work. (See Duncan Watts's brilliant book Everything is Obvious, Once You Know the Answer—reviewed here—to understand why.) If Gladwell is right, also in The Tipping Point, about how much news anchors can influence our votes by deploying their smiles for and against their preferred candidates, then democracy as we know it is a charade (and not for the reasons usually given, but for the completely unsupported reason that subliminal persuaders can create any electoral results they want). And so on. These ideas are far from obvious, self-evident, or trivial. They do have the property of engaging a hindsight bias, of triggering a pleasurable rush of counterintuition, of seeming correct once you have learned about them. But an idea that people feel like they already knew is much different from an idea people really did know all along.

Janet Maslin's New York Times review of David and Goliath begins by succinctly stating the value proposition that Gladwell's work offers to his readers:

The world becomes less complicated with a Malcolm Gladwell book in hand. Mr. Gladwell raises questions — should David have won his fight with Goliath? — that are reassuringly clear even before they are answered. His answers are just tricky enough to suggest that the reader has learned something, regardless of whether that’s true.

(I would only add that the world becomes not just less complicated but better, which leaves the reader a little bit happier about life.) In a recent interview with The Guardian, Gladwell as much as agreed: "If my books appear to a reader to be oversimplified, then you shouldn't read them: you're not the audience!"

I don't think the main flaw is oversimplification (though that is a problem: Einstein was right when he—supposedly—advised that things be made as simple as possible, but no simpler). As I wrote in my own review, the main flaw is a lack of logic and proper evidence in the argumentation. But consider what Gladwell's quote means. He is saying that if you understand his topics enough to see what he is doing wrong, then you are not the reader he wants. At a stroke he has said that anyone equipped to properly review his work should not be reading it. How convenient! Those who are left are only those who do not think the material is oversimplified.

Who are those people? They are the readers who will take Gladwell's laws, rules, and causal theories seriously; they will tweet them to the world, preach them to their underlings and colleagues, write them up in their own books and articles (David Brooks relied on Gladwell's claims more than once in his last book), and let them infiltrate their own decision-making processes. These are the people who will learn to trust their guts (Blink), search out and lavish attention and money on fictitious "influencers" (The Tipping Point), celebrate neurological problems rather than treat them (David and Goliath), and fail to pay attention to talent and potential because they think personal triumph results just from luck and hard work (Outliers). It doesn't matter if these are misreadings or imprecise readings of what Gladwell is saying in these books—they are common readings, and I think they are more common among exactly those readers Gladwell says are his audience.

Not backing down, Gladwell said on the Brian Lehrer show that he really doesn't care about logic, evidence, and truth—or that he thinks discussions of the concerns of "academic research" in the sciences, i.e., logic, evidence, and truth—are "inaccessible" to his lowly readers:

I am a story-teller, and I look to academic research … for ways of augmenting story-telling. The reason I don’t do things their way is because their way has a cost: it makes their writing inaccessible. If you are someone who has as their goal ... to reach a lay audience ... you can't do it their way.

In this and another quote, from his interview in The Telegraph, about what readers "are indifferent to," the condescension and arrogance are in full view:

And as I’ve written more books I’ve realised there are certain things that writers and critics prize, and readers don’t. So we’re obsessed with things like coherence, consistency, neatness of argument. Readers are indifferent to those things.

Note, incidentally, that he mentions coherence, consistency, and neatness. But not correctness, or proper evidence. Perhaps he thinks that these are highfalutin cares for writers and critics, or perhaps he is some kind of postmodernist for whom they don't even exist in any cognizable form. In any case, I do not agree with Gladwell's implication that accuracy and logic are incompatible with entertainment. If anyone could make accurate and logical discussion of science entertaining, it is Malcolm Gladwell.

Perhaps ... perhaps I am the one who is naive, but I was honestly very surprised by these quotes. I had thought Gladwell was inadvertently misunderstanding the science he was writing about, and making sincere mistakes in the service of coming up with ever more "Gladwellian" insights to serve his audience. But according to his own account, he knows exactly what he is doing, and not only that, he thinks it is the right thing to do. Is there no sense of ethics that requires more fidelity to truth, especially when your audience is so vast—and, by your own admission, so benighted—as to need oversimplification and to be unmoved by little things like consistency and coherence? I think a higher ethic of communication should apply here, not a lower standard.

This brings me back to the question of why Gladwell matters so much. Why am I, an academic who is supposed to be keeping his head down and toiling away on inaccessible stuff, spending so much time on reading his interviews, reviewing his book, and writing this blog post? What Malcom Gladwell says matters because, whether academics like it or not, he is incredibly influential.

As Gladwell himself might put it: "We tend to think that people who write popular books don't have much influence. But we are wrong." Sure, Gladwell has huge sales figures and is said to command big speaking fees, and his TED talks are among the most watched. But James Patterson has huge sales too, and he isn't driving public opinion or belief. I know Gladwell has influence for multiple reasons. One is that even highly-educated people in leadership positions in academia—a field where I have experience—are sometimes more familiar with and more likely to cite Gladwell's writings than those of the top scholars in their own fields, even when those top scholars have put their ideas into trade-book form like Gladwell does.

Another data point: David and Goliath has only been out for a few days, but already there's an article online about its "business lessons." A sample assertion:

Gladwell proves that not only do many successful people have dyslexia, but that they have become successful in large part because of having to deal with their difficulty. Those diagnosed with dyslexia are forced to explore other activities and learn new skills that they may have otherwise pursued.

Of course this is nonsense—there is no "proof" of anything in this book, much less a proof that dyslexia causes success. I wonder if the author of this article even has an idea what proper evidence in support of these assertions would be, or if he knows that these kinds of assertions cannot be "proved."

One final indicator of Malcolm Gladwell's influence—and I'll be upfront and say this is an utterly non-scientific and imprecise methodology—that suggests why he matters. I Googled the phrases "Malcolm Gladwell proved" and "Malcolm Gladwell showed" and compared the results to the similar "Steven Pinker proved" and "Steven Pinker showed" (adding in the results of redoing the Pinker search with the incorrect "Stephen"). I chose Steven Pinker not because he is an academic, but because he has published a lot of bestselling books and widely-read essays and is considered a leading public intellectual, like Gladwell. Pinker is surely much more influential than most other academics. It just so happens that he published a critical review of Gladwell's previous book—but this also is an indicator of the fact that Pinker chooses to engage the public rather than just his professional colleagues. The results, in total number of hits:

Gladwell: proved 5300, showed 19200 = 24500 total
Pinker: proved 9, showed 625 = 634 total

So the total influence ratio as measured by this crude technique is 24500/634, or over 38-to-1 in favor of Gladwell. I wasn't expecting it to be nearly this high myself. (Interestingly, those "influenced" by Pinker are only 9/634, or 1.4% likely to think he "proved" something as opposed to the arguably more correct "showed" it. Gladwell's influencees are 5300/24500 or 21.6% likely to think their influencer "proved" something.) Refining the searches, adding "according to Gladwell" versus "according to Pinker" and so on will change the numbers, but I doubt enough corrections will significantly redress a 38:1 difference.

When someone with this much influence on what people seem to really believe (as indexed by my dashed-off method) says that he is just a storyteller who just uses research to "augment" the stories—who places the stories first and the science in a supporting role, rather than the other way around—he's essentially placing his work in the category of inspirational books like The Secret. As Dan Simons and I noted in a New York Times essay, such books sprinkle in references and allusions to science as a rhetorical strategy. Accessorizing your otherwise inconsistent or incoherent story-based argument with pieces of science is a profitable rhetorical strategy because references to science are crucial touchpoints that help readers maintain their default instinct to believe what they are being told. They help because when readers see "science" they can suppress any skepticism that might be bubbling up in response to the inconsistencies and contradictions.

In his Telegraph interview, Gladwell again played down the seriousness of his own ideas: "The mistake is to think these books are ends in themselves. My books are gateway drugs – they lead you to the hard stuff." And David and Goliath does cite scholarly works, books and journal articles, and journalism, in its footnotes and endnotes. But I wonder how many of its readers will follow those links, as compared to the number who will take its categorical claims at face value. And of those that do follow the links, how many will realize that many of the most important links are missing?

This leads to my last topic, the psychology experiment Gladwell deploys in David and Goliath to explain what he means by "desirable difficulties." The difficulties he talks about are serious challenges, like dyslexia or the death of a parent during one's childhood. But the experiment is a 40-person study on Princeton students who solved three mathematical reasoning problems presented in either a normal typeface or a difficult-to-read typeface. Counterintuitively, the group that read in a difficult typeface scored higher on the reasoning problems than the group that read in a normal typeface.

In my review, I criticized Gladwell for describing this experiment at length without also mentioning that a replication attempt with a much larger and more representative sample of subjects did not find an advantage for difficult typefaces. One of the original study's authors wrote to me to argue that his effect is robust when the test questions are at an appropriate level of difficulty for the participants in the experiment, and that his effect has in fact been replicated “conceptually” by other researchers. However, I cannot find any successful direct replications—repetitions of the experiment that use the same methods and get the same results—and direct replication is the evidence that I believe is most relevant.

This may be an interesting controversy for cognitive psychologists, but it's not the point here. The point is that Gladwell says absolutely nothing about the controversy over whether this effect is reliable. All he does is cite the original 2007 study of 40 subjects and rest his case. Even those who have been hooked by his prose and look to the endnotes of this chapter for a new fix will find no sources for the "hard stuff"—e.g., the true state of the science of "desirable difficulty"—that he claims to be promoting. And if the hard stuff has value, why does Gladwell not wade into it himself and let it inform his writing? When discussing the question of how to pick the right college, why not discuss the intriguing research that debates whether going to an elite school really adds economic value (over going to a lesser-ranked school) for those people who get admitted to both. Or, when discussing dyslexia, instead of claiming it is a gift to society, how about devoting the space to a serious consideration of the hypothesis that this kind of early life difficulty jars the course of development, adding uncertainty (increasing the chances of both success and failure, though probably not in equal proportions) rather than directionality. There was so much more he could have done with these fascinating and important topics.

But at least the difficulty finding a simple experiment to serve as metaphor might have jarred Gladwell into realizing that the connection between the typeface effect, however robust it might turn out to be, and the effect of a neurological condition or loss of a parent, is in fact just metaphorical. There is no relevant nexus between reading faint type and losing a parent at an early age, and pretending there is just loosens the threads of logic to the point of breaking. But perhaps Gladwell already knows this. After all, in his Telegraph interview, he said readers don't care about stuff like consistency and coherence, only critics and writers do.

I can certainly think of one gifted writer with a huge audience who doesn't seem to care that much. I think the effect is the propagation of a lot of wrong beliefs among a vast audience of influential people. And that's unfortunate.

The Part Before the Colon: Is There a Trend Toward Cleverer Journal Article Titles?

2013-10-01T13:37:00.000-07:00

I joined the Society for Personality and Social Psychology last year, even though I am not a social psychologist, because I had to in order to give an invited talk at a pre-conference session of the annual SPSP meeting, which was held in New Orleans. I had a good time, despite having a bad headache during most of my visit. Social psychologists give lots of interesting talks, they tend to be social, and they also dress better than cognitive psychologists and neuroscientists. It was also fun to see which ones made a visit to the casino across the street from the conference hotel.

As an SPSP member, I now receive their flagship journal every month: Personality and Social Psychology Bulletin (PSPB—academics love to refer to journals with acronyms). One of the best parts of the journal, to a non-specialist like me, is the article titles. In psychology, as in many areas of science, there are different strategies for a good title. One is to concisely state the main finding of the paper or the main theoretical claim (occasionally formulated as a question rather than a statement). Another is to precede that kind of title with a clever quip, allusion, pun, or other phrase that grabs attention and orients the (potential) reader towards some aspect of the research you want to emphasize or that makes the work stand out. That is the part before the colon.

An example of this latter strategy is the 1999 article that Dan Simons and I published in Perception. The title was "Gorillas in our midst: Sustained inattentional blindness for dynamic events." (Thanks to M.J. Wraga, a fellow postdoc in the Harvard psychology department at the time, for suggesting the part before the colon.) If you are a real black belt in journal article writing, you can be like Dan Gilbert and combine both a statement of the main finding and a clever quip all into one phrase, as in his wonderful 1993 article (with two co-authors) "You Can't Not Believe Everything You Read." If there were a best title award this would surely be in the running. At least it's one of my favorites.

I think all kinds of titles can be good, if they are done well. There seems to be a trend toward more clever titles, at least during my time in psychology and social science. Consider the latest issue of PSPB (volume 39, number 10). Here are the article titles, just the parts before the colon:

1. "Show Me the Money"

2. Losing One's Cool

3. Changing Me to Keep You

4. Never Let Them See You Cry

5. Gender Bias in Leader Evaluations

6. Getting It On Versus Getting It Over With

7. The Things You Do For Me

8. "I Know Your Pain"

9. How Large Are Actor and Partner Effects of Personality on Relationship Satisfaction?

10. Touch as an Interpersonal Emotion Regulation Process in Couples' Daily Lives

I classify seven out of ten articles (all but #5, #9, and #10) as following the clever title strategy. That seems like a lot more than I used to see. To hastily test this intuition, I looked at the tables of contents for the same journal 10, 20, and 30 volumes ago, using issue 10 in 2003 and the final issue in 1993 and 1983 (since there were fewer than ten issues per volume then). There seems to have been a sharp increase:

2013: 70% (7 out of 10)

2003: 10% (1 out of 10: "The Good, the Bad, and the Healthy")

1993: 0% (0 out of 11)

1983: 17% (2 out of 12: "You Just Can't Count on Things Any More" and "Lonely at the Top")

Coincidentally, I received the latest issue of Clinical Psychological Science (volume 1, number 4; TOC apparently not online yet) today as well. It also has ten articles, and none of them have clever parts before the colon in their titles. Maybe clinical psychologists and their subject matter just aren't as funny.

Of course, this is hardly a serious statistical analysis of the phenomenon, and the quippy titles might have just coalesced at random in this particular issue, or this journal might have editors who encourage this kind of title. I should also say that I perceive the trend to exist in other areas besides social psychology. But I have heard it argued that this trend towards cleverer titles—if it really exists!—is a deleterious one, since it puts pressure on authors to come up with clever titles, and makes reviewers and editors and journalists expect to see them, and therefore it may distort the entire research endeavor towards work that can be summed up in not just the proverbial "25 words or less" but in the much higher standard of "10 very clever words or less." I have no strong belief as to whether all this is happening, or in what fields of study, but perhaps it's something to think about.

If someone does the research and writes a journal article on this, they are welcome to use the title "In 25 Words or Less: The Effect of Trends Toward Clever Pre-Colon Article Titles on the Content and Quality of Research." Just make sure to cite this blog entry, or come up with a catchier title yourself.

PS: I am fully prepared to be told that someone else has already said all this, or even done the research relating title catchiness to citation counts or other metrics. I have anticipated this in my other article, "Leap Before You Look: The Surprising Value of Writing Blog Entries Without Doing Your Research First."

Similarities Between Rolf Dobelli's Book and Ours

2013-09-12T13:28:00.000-07:00

Rolf Dobelli, a Swiss writer, published a book called The Art of Thinking Clearly earlier this year with HarperCollins in the U.S. The book’s original German edition was a #1 bestseller, and the book has sold over one million copies worldwide.

In perusing Mr. Dobelli’s book, we noticed several familiar-sounding passages. On closer examination, we found five instances of unattributed material that is either reproduced verbatim or closely paraphrased from text and arguments in our book, The Invisible Gorilla (Crown, 2010). They are listed at the end of this note.

Nassim Taleb (author of The Black Swan and other books) has also publicly noted similarities between his work and material in Mr. Dobelli’s book. We have also become aware of a similarity between material in Being Wrong by Kathryn Schulz and material in Mr. Dobelli’s book.

We sent a letter to Mr. Dobelli and his publishers noting our concern about these five passages. Mr. Dobelli replied to us privately and posted the following text on his website (since removed):

“I received two letters claiming inadequate attributions or citations in the book 'The Art of Thinking Clearly.' Some of the claims are true, some false. For the ones that are true, I take full responsibility. I will work closely with the publishers of my book that the corrections are put in effect as quickly as possible.”

In the interest of transparency, we have decided to post the list of similar passages here. We understand that Mr. Dobelli will be identifying the changes he intends to make to his book on his website as well.

— Christopher Chabris & Daniel Simons

Passages with overlap between The Invisible Gorilla by Christopher Chabris and Daniel Simons, and The Art of Thinking Clearly by Rolf Dobelli (portions of greatest similarity between the books are highlighted in red):

1. Chabris/Simons: In April 2006, rising waters made a ford through the start of the Avon River temporarily impassable, so it was closed and markers were put on both sides. Every day during the two weeks following the closure, one or two cars drove right past the warning signs and into the river. These drivers apparently were so focused on their navigation displays that they didn’t see what was right in front of them. [pp. 41–42]

Dobelli: After heavy rains in the south of England, a river in a small village overflowed its banks. The police closed the ford, the shallow part of the river where vehicles cross, and diverted traffic. The crossing stayed closed for two weeks, but each day at least one car drove past the warning sign and into the rushing water. The drivers were so focused on their car’s navigation systems that they didn’t notice what was right in front of them. [p. 263; opening paragraph of Chapter 88]

2. Chabris/Simons: The “Nun Bun” was a cinnamon pastry whose twisty rolls eerily resembled the nose and jowls of Mother Teresa. It was found in a Nashville coffee shop in 1996, but was stolen on Christmas in 2005. [p. 155]

Dobelli: The “Nun Bun” was a cinnamon pastry whose markings resembled the nose and jowls of Mother Teresa. It was found in a Nashville coffee shop in 1996 but was stolen on Christmas in 2005. [p. 310]

3. Chabris/Simons: “Our Lady of the Underpass” was another appearance by the Virgin Mary, this time in the guise of a salt stain under Interstate 94 in Chicago that drew huge crowds and stopped traffic for months in 2005. Other cases include Hot Chocolate Jesus, Jesus on a shrimp tail dinner, Jesus in a dental x-ray, and Cheesus (a Cheeto purportedly shaped like Jesus). [p. 155]

Dobelli: “Our Lady of the Underpass” was another appearance by the Virgin Mary, this time as a salt stain under Interstate 94 in Chicago in 2005. Other cases include Hot Chocolate Jesus, Jesus on a shrimp tail dinner, Jesus in a dental X-ray, and a Cheeto shaped like Jesus. [p. 310]

4. Chabris/Simons: In other words, almost immediately after you see an object that looks anything like a face, your brain treats it like a face and processes it differently than other objects. [p. 156]

Dobelli: As soon as an object looks like a face, the brain treats it like a face—this is very different from other objects. [p. 310]

5. The paragraph in Dobelli is a condensation and paraphrase of a longer passage and argument appearing in The Invisible Gorilla:

Chabris/Simons: It may come as a surprise, then, to learn that talking to a passenger in your car is not nearly as disruptive as talking on a cell phone. In fact, most of the evidence suggests that talking to a passenger has little or no effect on driving ability.⁴⁰

Talking to a passenger could be less problematic for several reasons. First, it’s simply easier to hear and understand someone right next to you than someone on a phone, so you don’t need to exert as much effort just to keep up with the conversation. Second, the person sitting next to you provides another set of eyes—a passenger might notice something unexpected on the road and alert you, a service your cell-phone conversation partner can’t provide. The most interesting reason for this difference between cell-phone conversation partners and passengers has to do with the social demands of conversations. When you converse with the other people in your car, they are aware of the environment you are in. Consequently, if you enter a challenging driving situation and stop speaking, your passengers will quickly deduce the reason for your silence. There’s no social demand for you to keep speaking because the driving context adjusts the expectations of everyone in the car about social interaction. When talking on a cell phone, though, you feel a strong social demand to continue the conversation despite difficult driving conditions because your conversation partner has no reason to expect you to suddenly stop and start speaking. These three factors, in combination, help to explain why talking on a cell phone is particularly dangerous when driving, more so than many other forms of distraction. [p. 26]

Dobelli: And, if instead of phoning someone, you chat with whomever is in the passenger seat? Research found no negative effects. First, face-to-face conversations are much clearer than phone conversations, that is, your brain must not work so hard to decipher the messages. Second, your passenger understands that if the situation gets dangerous, the chatting will be interrupted. That means you do not feel compelled to continue the conversation. Third, your passenger has an additional pair of eyes and can point out dangers. [pp. 353–354]

Should Poker Be (A Tiny Bit) More Like Chess?

2013-08-22T12:17:00.000-07:00

There are similarities between tournament poker and tournament chess, and many serious chess players, including some grandmasters, have taken up poker with success. Poker is a much, much richer game, however, because there is more variance in outcomes when weaker and stronger players face each other. Thousands of poker players pay $1000, $1500, or even $10,000 to enter poker tournaments, routinely creating multimillion-dollar prize pools. Most chess tournaments, except for invitational events reserved for the very top players in a country or the world, also charge entry fees, but tournament chess doesn't have enough variance to get people to put up even $1000 (in today's dollars) to play. So it's a good thing that poker isn't more like chess in the way that stronger chess players are very likely to win against weaker ones.

But as I was reading the August 21 issue of Card Player magazine recently, I stumbled across a discussion that got me thinking that poker still needs some improving. In a column called "The Rules Guy" (which is not yet available online) I read the following:

The Rules Guy: Props to Antonio Esfandiari. TRG salutes Antonio Esfandiari for saying "You're both out of line" to Jungleman (Dan Cates) and Scott Seiver after their intense verbal altercation on a Party Poker Premier League VI broadcast. A calming voice can, well, work magic.

What happened, in a nutshell, was that Cates had broken the rules by acting out of turn several times at the table. Acting out of turn means betting, folding, or doing other things you normally do when it is your turn to bet, but doing them before it is your turn. This is bad because it gives the players who are supposed to act before you information about your hand strength and intentions that they aren't supposed to have. Therefore it can help those players, and also hurt other players. It can also be a way of colluding with others at the table, which is obviously a fundamental no-no. Seiver called out Cates on his repeated out-of-turn acting, Cates said something in response, Seiver said "it's like actual cheating," Cates used the f-word, and it went on from there.

At some point, according to the article, Esfandiari said, "You're both out of line. You're [Cates] out of line for acting out of turn; you're [Seiver] out of line for attacking him." He is portrayed as the level-headed hero of the whole episode and gets "props" from The Pseudonymous Rules Guy.

When I was at the World Series of Poker this past June, I played in a $1500 buy-in no limit hold'em tournament. I was doing pretty well at my first table, but then the table broke and I was moved to a table of mostly younger players, plus one very well-known pro: Phil Laak, who is a close friend of Antonio Esfandiari. (They appear together on the ESPN broadcasts and even co-hosted an entire series about prop betting a few years back.) Laak was three seats to my right, and he was acting just like he acts on all the televised poker events that love to show him. He was hamming it up, saying crazy and clever things, acting alternately bored and intensely interested, jumping up to take an occasional picture with a fan, making friends with everyone, and so on.

Laak was in the last hand before the dinner break. The clock had run down, so everyone was free to go if they wanted to, but I stayed at the table to see the hand play out. Laak was heads-up against the player to his left (two seats to my right). There was much betting, and after the river card was dealt Laak went all-in. His opponent started thinking about this major decision of whether to call or fold. He had enough chips to call without being knocked out, but a lot of chips were at stake.

By this point Esfandiari had come over to our table. I don't know if he was playing in the same tournament, another tournament, or what, but he came to talk to Laak about plans for the dinner break. The two of them were talking, while Laak was in the hand, even before Laak had made his final all-in bet. This seems bad to me. Why should any player who is in a hand be allowed to say anything to anyone else while the hand is going on? But it got worse.

As Laak's opponent, who as far as I could tell did not know Laak personally, or at least was not great friends with him, was thinking over his decision, Esfandiari leaned over him and said something like "Hurry up and fold, we want to go eat!" Those probably weren't his exact words; I didn't write them down. But he clearly spoke to the player whose turn it was to act, and clearly spoke to him about one of the actions he was contemplating. He didn't just say "hurry up"—he mentioned folding too.

Now I don't think Esfandiari knew what Laak's hand was. Perhaps he was just goofing around because he was hungry and wanted to go and eat. But I wouldn't be surprised if he was also trying to help Laak just a little bit, perhaps unconsciously, by throwing Laak's opponent off his train of thought, or by sewing doubt about the right play to make.

As it happened, the guy didn't seem bothered by Esfandiari and didn't complain about him. He eventually called, only to find that Laak had made a straight flush on the river. In retrospect, he should have taken the "advice" and folded.

Regardless, I found Esfandiari's actions appalling. I didn't say anything, not being an expert in poker rules and etiquette, and not being involved in the hand, but I thought that if this was legal, it was a very big difference from chess. In a chess tournament, you aren't allowed to do anything close to that. If a friend of Magnus Carlsen walked up to Carlsen's opponent and said "just resign already" while he was thinking about his next move ... I cannot imagine what would happen, since it's so far outside the realm of possibility. Garry Kasparov used to be criticized severely for making faces during games in reaction to his opponent's moves. This is orders of magnitude worse.

Esfandiari may have been right about Cates and Seiver (though I think repeatedly acting out of turn is worse than getting pissed off at someone for repeatedly acting out of turn), but I think he was wrong to say a single word to Laak, or especially Laak's opponent, while their hand was in progress. It doesn't matter that he's a famous pro, or that he's the all-time biggest money winner in tournament poker, or that he's considered to be a nice guy. Let's keep poker different from chess in all the ways that matter for its popularity, but let's make it more like chess by enacting or enforcing rules that help each player, amateur and pro alike, make their decisions by themselves, in peace.

Polishing Rabbits and Passing Off Squirrels—Andrew Zolli on Jonah Lehrer

2013-02-22T19:38:00.004-08:00

Andrew Zolli, the Executive Director and curator of PopTech, as well as the co-author of Resilience, sent me a very thoughtful reflection in response to my earlier post on Jonah Lehrer and his recent apology. He had tried to post it as a comment on my post, but ran up against Blogger's comment length limits. So with Andrew's consent, I am posting it below. I think that Andrew's points are excellent. (Note: I have been an invited speaker at PopTech, both on the stage and to the Fellows program.)

At this point, the whole sad L'Affaire d'Lehrer has been dissected into a finely-ground powder, and everyone has assigned Jonah appropriate culpability, including Jonah himself. What I find of more lasting interest is a systemic issue which Chris touches on glancingly, above:

We live in a media moment that massively encourages and rewards the pulling of proverbial rabbits out of hats—storytelling that culminates in a counterintuitive fact about human beings and their nature. It's sort of "Sudoku storytelling", in which the reader is presented with a confusing storyline, and the author presents a rubric and reassembles the elements in a way that snaps the pieces into place in a clean and satisfying way. This kind of writing gives the reader a little positive jolt, a sense that they've been let in on some secret wisdom that decodes part of the human condition. (That "snapping into place" phenomenon—it's what makes a joke with a good punchline work too - you know it's coming, and you can't quite see how it will resolve itself, and then *wham*—there it is! The same is true for get-rich-quick-schemes.)

These are the kinds of pieces—not just books, but blog pieces, and other forms of writing—that go "viral." Our appetite for such secret wisdom is so strong that passing them along actually raises the social capital of the *forwarder*, not just the author. (This is what Twitter was made for, I believe.)

And this is *exactly* the kind of content that beleaguered mainstream editors often push writers, particularly talented writers, to produce—not nuanced tomes with confidence intervals attached to data, including examples of counterfactuals and copious footnotes—but snappy, highly "applicable," linear narratives (with counterintuitive endings!) that sacrifice complexity for accessibility. (As one editor put it to me: "You wanna write that other shit? Go to a university press!")

And its not just editors—these are the kinds of books that command significant advances, that backlist, that build the author's speaking fees, that get them bylined articles in prominent magazines, and tv appearances—a whole edifice that, most of the time, ends up with the "talent" becoming a not-terribly-intellectual-public-intellectual. (By the way, it's not just science writers … business gurus in particular are often peddlers of pure horseshit, yet find a insatiable appetite for their nonsense. Because if there's one thing human beings find even more interesting than ourselves, it's how to make a buck off of some other clueless rube.)

Of course, the big problem is that there really aren't an endless supply of rabbits to pull out of hats. And not all rabbits are of first quality—sometimes, we have to "polish the rabbit," so to speak. And that's how I believe Jonah (whom I know personally, though not well) got into this predicament—being overly committed to the rabbit production line. So you start to reuse your rabbits, then you try to pass off second quality rabbits by making them look all the more surprising. And then you're panicked to discover you're passing off squirrels.

Oddly enough, the rabbit-out-of-the-hat counterintuitive ending is actually Jonah's story, which is why his downfall itself went viral. You think this guy is just blessed with preternatural explanatory talent, but it turns out, "the 'Imagine' guy was making up his own quotes!" It's a joke! And a punchline! Love it! Instant schadenfreude! Have you heard? Pass it on!

I am not excusing Jonah for his mistakes, which are significant. I think it's an honor to be held to a high standard, and he failed that standard, more than once. Worse, he had (and has) the abundant and enviable talent not to fail. And there should be real consequences for his having done so.

Yet I also think we ought to be careful in making him a cautionary tale for a civilization drowning in its own bullshit. He was unprofessional, but he was also responding to perverse incentives and societal norms in our public square that we collectively bolster, if not passively tolerate, by our own consumption habits.

For me, I'm trying to become more mindful of my own bullshitological contributions—which are, I'm sure greater than I'd care to admit. I'm also finding myself reflecting on how we might make the system itself better, with fewer incentives for bad behavior, and better rewards for good behavior.

Because, while I'm sure there is some intrinsic character in all of us, it's also true that incentives draw forth aspects of that character, which then can come to publicly define us. (I can be fairly charitably-minded until someone cuts me off in traffic; fortunately for me, my utterances thereafter are not part of the public record.)

So here's my concluding truism: Piling on Jonah is like jumping on a trampoline: fun for a while, but it won't take us very far. Better to think about how we can springboard to a better place for everyone.

I know it's not counter-intuitive enough. I guess I'll never make it in this business.

How Much BAM for the Buck, and Other Thoughts on the Brain Activity Map Project

2013-02-18T19:57:00.003-08:00

Today's New York Times reports that the Obama administration is considering a massive, partly government-funded project to map the human brain, the Brain Activity Map (BAM!) Project, inspired by the success of the Human Genome Project.

Let me start by saying that I am all in favor of more research in neuroscience, because there is certainly a lot we don't know about how the brain works. While to outsiders like Ray Kurzweil it may look like progress is coming in leaps and bounds, and backing up the mind's hard drive is therefore a calculable number of years away, from the inside the effort to understand the brain often seems to zigzag from new idea to cool finding to neat technology without a clear forward trajectory. I am also a big fan of George Church, a genius and visionary of molecular biology who is one of the driving forces behind the new plan. (I even once co-taught a course on cognitive genetics at Harvard with George's wife, the geneticist Ting Wu.) But before we all jump on this bandwagon, let's discuss the pros and cons—based on what has been said publicly so far (mainly in the Times article, which was prefigured by a Neuron article by Church and several others published last June).

Per the Times, the project is expected to cost "billions of dollars" and last 10 years. Its goals are to "advance the knowledge of the brain's billions of neurons and gain greater insights into perception, actions, and, ultimately, consciousness." So far, so good—basic science. Some also hope that the project will "develop the technology essential to understanding diseases like Alzheimer's and Parkinson's, as well as to find new therapies for a variety of mental illnesses." That's certainly possible, though I cannot think of any treatments for mental illness or brain disease that have been derived from previous maps of the brain or knowledge of its activity patterns. Perhaps this is just an argument that we need better maps. Finally, "the project holds the potential of paving the way for advances in artificial intelligence." Certainly also possible, but I think AI has been doing pretty well lately by ignoring brain architecture and going with whatever algorithms work on computer hardware to produce intelligent-seeming behavior.

The Times account is short on details of what precisely is being proposed, which has led some people to think that the idea is to map every connection and the firing activity of every neuron in (at least) one human brain, or to make more maps of the functions of brain regions using neuroimaging techniques. But the Neuron article by the Brain Activity Map proponents makes it clear that, last June at least, the idea was to start with small circuits in very small organisms, where it may soon be possible to record from every participating neuron at once, and to work up to larger circuits and larger organisms. All these maps would record "the patterns and sequences of neuronal firing by all neurons" in the relevant circuit or brain, so they would be much more detailed, in both space and time, than any existing databases. A drosophila brain might be done in ten years, a mouse neocortex in fifteen. The entire human brain would be a more distant goal. And of course there would be ethical issues to be surfaced and solved along the way to that ultimate step.

There are a lot of things to like about this ambition. Although we already have lots of maps of the brain, none of them (but one—the structural connectome of the C. elegans worm) approach the spatial resolution of a neuron-by-neuron map. The main source of our knowledge about how neurons represent information, carry out computations, and communicate with other neurons is still the single-cell recording, a technique developed about half a century ago. Such methods are based on inserting tiny electrodes in or near living neurons, and have obvious limitations, not least their inability to scale to full circuits or brain regions. Recording entire circuits in action would be a fantastic achievement and probably would lead to all sorts of ancillary benefits for advancing brain research, some foreseeable and some not. And perhaps more neuroscientists would be able to find jobs along the way!

But there are some considerations on the other side of the ledger, too. One that should not be underestimated is the opportunity cost; always, but especially nowadays, it would be a mistake to imagine that the funding for a new, large project will appear out of thin air. If the BAM goes forward, other areas are likely to get less funding, and other neuroscience and behavioral science projects will likely be among the first to be reduced. Moreover, a single mega-project is likely to supplant many smaller projects. Is our neuroscience money best spent on one project costing, say, $5 billion, or instead a thousand projects of $5 million each, or ten thousand projects with $500K budgets? Gary Marcus has a suggestion for five $1 billion projects. Which funding strategy is likely to result in more important discoveries, as viewed from the perspective of the next generation of scientists looking back? Maybe the BAM, but maybe not. The answer is hardly obvious to me. The big project is concrete and tangible, with milestones in the near future. The net effect of the tinkering of ten thousand labs with comparatively small budgets is harder to conceive of, but might turn out to be much larger.

One reason to be suspicious of the potential return-on-investment of a massive BAM project is that it's being sold by comparing it to the Human Genome Project (HGP), with a claim that the HGP produced $141 in economic activity for every $1 the government spent on it. President Obama cited this figure in his State of the Union Address. That's a return of fourteen thousand percent! Can that be right? If so, it would mean that about $800 billion in economic activity has been generated by that one government "investment." It turns out that this claim comes from a Batelle report (which is cited by the BAM advocates in their Neuron article) that was sponsored by a company that makes equipment used in life science research.

I find this figure hard to believe, not to say preposterous. Does it really represent net economic activity, or does it account for activity displaced from other spheres, and was all that economic activity the best activity that could have been done, or was it activity that pursuit of grant funding and other non-market incentives encouraged? What if the same amount of government money had been spent in funding lots of individual genetics researchers instead, or on other biology researchers, or other science entirely? The certainty with which these sorts of analyses are presented makes it hard to see counterfactual alternatives, but they lurk everywhere. At a minimum the $800B value must rest on a lot of assumptions, and the specific assumptions made probably have a large impact on the value that comes out of the analysis.

To be clear: I think the genome project was a great scientific idea, I suspect that it has produced a lot of benefits, and I am personally happy it was done. I just don't think it should be oversold. As Richard Feynman pointed out in his famous "Cargo Cult" speech, public support for research will eventually erode if it is sold with outrageous-sounding claims or promises of early benefits.

But suppose it is true that the Human Genome Project was the single best thing the U.S. government ever spent its money on—sorry, "investment it ever made"—the government's version of buying Apple stock for $5 and selling at $700. Should we expect similar returns from the next big science project? Or should we expect to see the economic return and gains in knowledge achieved by the average of the big science projects that the government has funded over the past decades? The abandoned supercollider, the war on cancer, the cancelled breeder reactor, and I am sure many others fade from memory—and certainly never get mentioned—when we are told about the 141X ROI of the genome project (worthy as it was). An analysis that looked at all the comparable projects rather than just the all-time outlier might come to a different projection of the likely value of the BAM. We might still expect a positive return, but without the 141X (or whatever the true value is), it will have a tougher time competing with other priorities, or with other ways of parceling out neuroscience funding.

Europe has thrown its lot behind the single mega-project approach, with an effort to simulate an entire brain at a cost of over 1 billion Euros. Regardless of the (questionable) merit of this idea, perhaps the U.S. should play a different strategy in the competition for research glory by letting a thousand flowers bloom rather than planting one ginormous tree. Indeed, such a contrarian approach may have value precisely because of the limits of the mapmaking approach to understanding the brain.

Forty years ago, single-cell neurophysiologist Horace Barlow famously proposed that "a description of that activity of a single nerve cell which is transmitted to and influences other nerve cells and of a nerve cell's response to such influences from other cells, is a complete enough description for functional understanding of the nervous system." The BAM Project seems to be a plan to create exactly this sort of description, but at a much larger scale. But as David Marr explained in his 1982 book Vision, and as Hilary Putnam also suggested in his 1973 Cognition article "Reductionism and the Nature of Psychology," there are several other levels of explanation that are equally important in reaching a "functional understanding" of how the brain works. The representations, algorithms, and computational functions of the brain and its circuits, as well as the relationship of the brain to the organism and its environment and niche, are just as important as a map that shows how the neurons are wired up and how they send signals to one another.

Again, it is not that a BAM would have no value. I would personally be fascinated to see its results, and those results might well help us to crack the problem of how higher-level properties emerge out of agglomerations of lower-level events (which the psychologist Stephen Kosslyn, a founder of cognitive neuroscience, proposed as one of the hardest problems in social science). But the sheer size of a full BAM project might focus our attention and hopes on the BAM as the be-all and end-all of neuroscience, and distract the field from devoting energy to those other levels. Cognitive scientist Mark Changizi has eloquently argued, in fact, that the massive project we ought to be pursuing is a map of the "teleome," his coinage for the suite of functions and abilities that the nervous system was designed by evolution to perform. Without knowing more about function, it will be hard to understand the BAM's results, and perhaps even harder to build the EU's whole-brain computer simulation. As the proposal moves forward, I hope the decision-makers keep in mind that maps, while incredibly useful tools, don't give answers to every important question.

What Has Been Forgotten About Jonah Lehrer

2013-02-12T12:45:00.000-08:00

Today the science writer Jonah Lehrer made his first extended public remarks since he resigned his various positions and his publisher withdrew his third book last summer. The venue was a Knight Foundation conference in Miami. Lehrer gave a short speech about decision making, focusing on his own bad decisions and how he plans to prevent them from recurring in the future. To my surprise, the foundation, which supports "journalistic excellence," seems to have paid Lehrer $20,000 for his appearance.

As is well known, Lehrer first got into trouble last year when it was revealed that his new blog at the New Yorker incorporated much material that he had previously published, including in his old column at the Wall Street Journal. This led to a suspension of his blogging privileges. Then various investigations showed that he had not only "self-plagiarized" (a lazy and exploitative practice) but also plagiarized the work of others, and perhaps worst of all embellished and fabricated quotes from his interview subjects (most prominently Bob Dylan) and other sources. The New Yorker finally let him go, as did Wired. He completely ceased tweeting, Facebooking, or updating his website.

At first I felt bad about Jonah Lehrer's problems. He seemed like a nice person. When I published a fairly negative review of his third book, Imagine: How Creativity Works, in the New York Times, he was up on his blog with a reply, titled "On Bad Reviews," in a matter of hours. I wrote my own strong rebuttal and posted it a couple of days later. The next day, Lehrer emailed me proposing that he interview me by email about the issues I had raised, for publication on his blog. We did the interview, which took several weeks to complete. After various delays, caused by the suspension and then cancellation of his blog, the interview was finally published at the Creativity Post website. I was pleasantly surprised that Lehrer bothered to engage my criticism, and then to ask me directly how I thought he (and other science writers) could improve their practices. I was a bit upset when he tried to block the final publication of the interview, which was supposed to happen (coincidentally) the day after he departed the New Yorker, but the Creativity Post editors managed to convince him to change his mind.

When the allegations of plagiarism and fabrication came out, the story became one of "greatest science writer of his generation makes unthinkable mistakes," and the analysis was mostly psychoanalysis of Lehrer's motives or of the media culture. Entirely lost was the fact that Jonah Lehrer was never a very good science writer. He seemed not to fully understand the science he was trying to explain; his explanations were inaccurate, overblown, and often just plain wrong, usually in the direction of giving his readers counterintuitive thrills and challenging their settled beliefs. You can read my review and the various parts of my exchange with him that are linked above for detailed explanations of why I make this claim. Others have made similar points too, for example Isaac Chotiner at the New Republic and Tim Requarth and Meehan Crist at The Millions. But the tenor of many critics last year was "he committed unforgivable journalistic sins and should be punished for them, but he still got the science right." There was a clear sense that one had nothing to do with the other.

In my opinion, the fabrications and the scientific misunderstanding are actually closely related. The fabrications tended to follow a pattern of perfecting the stories and anecdotes that Lehrer -- like almost all successful science writers nowadays -- used to illustrate his arguments. Had he used only words Bob Dylan actually said, and only the true facts about Dylan's 1960s songwriting travails, the story wouldn't have been as smooth. It's human nature to be more convinced by concrete stories than by abstract statistics and ideas, so the convincingness of Lehrer's science writing came from the brilliance of his stories, characters, and quotes. Those are the elements that people process fluently and remember long after the details of experiments and analyses fade.

After the Dylan episode, others found more examples of how Lehrer did this. I think one of the clearest was Seth Mnookin's analysis of Lehrer's retelling of psychologist Leon Festinger's famous original story of "cognitive dissonance," based on Festinger's experience of infiltrating a doomsday cult in 1954. Of the moments after an expected civilization-destroying cataclysm failed to start, Festinger wrote, "Midnight had passed and nothing had happened ... But there was little to see in the reactions of the people in that room. There was no talking, no sound. People sat stock still, their faces seemingly frozen and expressionless." Lehrer narrated the same event as follows: "When the clock read 12:01 and there were still no aliens, the cultists began to worry. A few began to cry. The aliens had let them down." Do you see the difference? Lehrer's version is more dramatic: people worry, they cry, they feel let down. It's more human. Each one of these little errors or fabrications makes the story work a little bit better, makes it match our expectations more closely, and thus gives it greater influence on our beliefs.

So by cutting exactly these corners in his writing, Lehrer was able to mask the fact that his conclusions were facile or erroneous, and his prose earned him a reputation for being much more authoritative than he was. Who was harmed by all of this? Writers who were trying to do with correct understanding and real quotes and stories what Lehrer did with his "material," for one. And certainly his editors, publishers, and anyone else who paid money for his halo and his drawing power. But readers most of all, since they were told things about how nature works that simply weren't true. Not just what Bob Dylan said and when he said it, but what it has to do with creativity, neuroscience, and everything else.

Jonah Lehrer gave a talk today that was more interesting than I expected. He acknowledged his mistakes and said he was trying to erect operating procedures and safeguards to make sure his own arrogance stays in check in the future. He said some things that were hard to believe, such as his claim that he has a poster in his office of Bob Dylan by Milton Glaser (a graphic artist also misquoted by Lehrer), and that he flinches every time he sees it. Does he really flinch every time? Hasn't habituation or inattention taken care of that by now?

I actually think Lehrer might be able to return to writing successfully, because he has the technical skills, and he is obviously a very intelligent and energetic person. But he should take the time to not only protect himself against his tendency to fabricate and plagiarize, but also to learn the basics of journalistic practice and ethics, to learn how to think clearly about science and facts, and above all to commit himself to the truth. Then maybe he will have something valuable to tell us.

Six Big Problems With "Why Can Some Kids Handle Pressure ..."

2013-02-11T19:43:00.001-08:00

Surely how kids handle pressure is an important and interesting question. And surely how we perform in pressure situations has a lot to do with our genes. But the recent New York Times article "Why Can Some Kids Handle Pressure While Others Fall Apart?" by Po Bronson and Ashley Merryman is shot through with the most basic mistakes in science writing about behavior genetics. This makes me sad, because I have liked the authors' previous books, and because I think it is quite possible to communicate research on genetics accurately for an intelligent general audience. Here, unfortunately, they appear to have taken no note of what has happened in behavior genetics in the past 5–10 years, which ought to have been a prerequisite for this piece. A few examples:

Exaggerated claims: "One particular gene, referred to as the COMT gene, could to a large degree explain why one child is more prone to be a worrier, while another may be unflappable" [emphasis added]. In reality, what kind of COMT gene you have, if it is relevant, is an extremely minor influence by itself on how much you worry. The particular variant of the COMT gene being discussed here is very common, and like all other common genetic variants, it has never been shown to have a large, or even medium-sized, influence on any behavioral traits.

Cherrypicking the study with the most dramatic results: "Other research has found that those with the slow-acting enzymes have higher IQs, on average. One study of Beijing schoolchildren calculated the advantage to be 10 IQ points." In 2013 it should be regarded as journalistic malpractice to write things like this when the average of all the studies on this gene and IQ show the effect to be, at best, a tiny fraction of 10 IQ points. In an analysis that included almost 10,000 subjects from two countries, in fact, a team of colleagues and myself found virtually no evidence of any effect of COMT on IQ.

Idealizing your favorite study: "In other words, the exam was a perfect, real world experiment for studying the effects of genetics on high-stakes competition." In reality, there are no "perfect" experiments, and the one Bronson and Merryman report on had only 779 subjects, which might seem like a lot, but is almost certainly too small to learn anything reliable about genetic effects. About 100 times more participants are needed to really answer these questions.

Labeling genes with behaviors and pretending that possessing a genetic variant makes you a particular type of lucky or unlucky person: The two variants of the COMT gene are labelled "warrior" and "worrier" (for the different responses to stress they supposedly cause people to have—get it??), and then people are in turn labelled as Warriors or Worriers based on their genotypes. That's tantamount to calling the variants of APOE the "Doofus" and "Genius" genes because one makes you more likely to develop Alzheimer's disease while the other offers some protection against dementia. No, wait, it's not, because APOE has a highly significant effect on Alzheimer risk that has been replicated over and over by independent researchers, but COMT's links to the behaviors discussed in this article are smaller and more tenuous. Later we are told that the Worriers' "genetically blessed working memory and attention advantage kicked in. And their experience meant they didn't melt under the pressure of their genetic curse." I thought we gave up on this kind of superficial genes-as-personality-types-and-blessings-or-curses kind of science writing years ago.

Contradicting your own point: "... we are all Warriors or Worriers ... In truth, because we all get one COMT gene from our father and one from our mother, about half of all people inherit one of each gene variation, so they have a mix of the enzymes and are somewhere in between the Warriors and the Worriers." (Is anyone else reminded of the camp 1970s film "The Warriors," about gangs that roam the New York City subways?) We can't all be one type or the other if half of us are both. And incidentally, the pattern of 25%-50%-25% of the three genotypes does not arise only because we get one allele from each parent. It also depends on the frequency of the two variants being about 50% each in the population, which it happens to be in the case of this COMT polymorphism.

Pretending that what has been known for generations is a new discovery: "Stress turns out to be far more complicated than we've assumed ... short-term stress can actually help people perform ..." And later: "It may be difficult to believe ... that stress can benefit your performance." But psychology textbooks have long taught that the level of arousal for optimal performance is moderate, with too much arousal or too little leading to lower performance. This is called the Yerkes-Dodson Law, and it was originally proposed in 1908. Perhaps worth a mention?

The article makes much of findings that "those with Worrier-genes can still handle incredible stress." This would only be surprising if COMT had such a strong effect that it could determine what kind of person you are. But COMT doesn't have that effect. It's surprising when someone with the genotype for brown eyes has blue eyes instead, because the relevant genes almost completely determine the phenotype. It's not surprising that people with one of hundreds or thousands of genes that make one susceptible to stress turn out to be able to handle themselves just fine.

If the authors were conversant with—and showed concern for—the relevant literature and the background science, they would not have made these mistakes. I understand that they are writers, not researchers, but people who write about research for the public have a simple obligation to communicate not just good stories, but reliable facts.