Christopher Chabris

Thursday, November 7, 2019

For Better or Worse, the A/B Effect Is Real

This post is by Michelle N. Meyer, Patrick R. Heck, Stephen M. Anderson, William Cai, Duncan J. Watts and Christopher F. Chabris

1. Introduction

Earlier this year we published a set of studies in PNAS (open access article; Supplemental Information; OSF preregistrations, data, and code) on what we called the “A/B Effect,” which we defined as occurring when people “judge a randomized experiment comparing two unobjectionable policies or treatments (A and B), neither of which was known to be superior, as less appropriate than simply implementing either A or B for everyone” (Meyer et al., 2019, p.102074). More precisely, we found that when separate groups of participants rated the appropriateness of policy A (for example, giving teachers a yearly bonus), policy B (e.g., giving teachers more vacation days), or an A/B test with the explicit goal of learning which policy leads to the best outcome (and then making that the policy for everyone going forward), the A/B test received lower mean appropriateness ratings, and more participants explicitly rated it as inappropriate, than either A or B. We observed this effect to varying degrees when people evaluated vignettes in several domains, including health care, humanitarian aid, education, tech, and “corporate” experimentation. We concluded that policymakers might face more backlash from running A/B tests to determine the effects of their proposed policies than simply using their intuition to pick and impose a policy on everyone—a possibility that we, as fans of evidence-based decision-making, find lamentable.

Today, Berkeley Dietvorst, Robert Mislavsky, and Uri Simonsohn have posted on Simonsohn's Data Colada blog a critique of our paper. Their post is an expanded version of a Letter by Mislavsky, Dietvorst, and Simonsohn (henceforth MDS) published yesterday in PNAS (our Reply will appear in PNAS soon) (UPDATE: our Reply was published in PNAS on 12 November 2019). MDS claim that “experiment aversion” (a term they use to refer to the A/B Effect we described, but to which they attach a more narrow meaning—a point we will elaborate on below) can be explained away as the result of a statistical artifact, and is therefore not actually occurring in our studies. We lay out this argument in section 1 below, but it boils down to a claim—for which they report evidence in their own paper that is in-press at Marketing Science—that people’s rating of an experiment is never lower than their rating of the worst policy that that experiment contains. Because our PNAS studies were all between-subjects, where participants rated A or B or the A/B test, but not all three, we can only compare one group’s ratings of the A/B test to the ratings of A and B made by the other two groups of participants.

In designing our studies, we made a deliberate decision to conduct between-subjects experiments. In the real world, people are rarely given the opportunity to evaluate option A, option B, and a test comparing these. When a policy is implemented, it is rare that rejected alternative policies—or policies the decision-maker never thought of—are announced alongside of it. Similarly, when an A/B test comparing two policies is announced, it is virtually never pointed out that the decision-maker could have chosen to impose A or B on everyone. The world, in other words, is a between-subjects experiment, where people rarely realize or contemplate the fact that they are, in effect, subject to policies and decisions that make them participants in any number of uncontrolled experiments. To choose anything other than a between-subjects design, therefore, would have sacrificed the external validity of our studies.

We did think, however, that alerting raters of A/B tests to the fact that agents often have the power to simply impose either untested policy on everyone might improve perceptions of experiments and/or that alerting raters of policies to foregone alternatives might make people more skeptical of policies. Either (or both) would have the effect of reducing the A/B Effect—and mitigating people’s objection to experimentation is one of the ultimate goals of our research program. To that end, we had long planned to run within-subjects experiments, some of which we have now conducted and reported in a preprint and describe below. In all three of the scenarios from Meyer et al. (2019) that we have so far tested in a fully within-subjects design, including one from the Study 3 scenarios that MDS prefer, we find robust evidence not only of the A/B Effect, as we defined it (AB < mean(A,B)), but also of increasingly conservative forms of “experiment aversion,” including MDS’s definition (AB < min(A,B)) and an even more conservative definition which requires individuals to rate the AB test as inappropriate while also not objecting to policy.

In this essay, we analyze MDS’s arguments and explain why there is strong evidence for both the A/B Effect and for MDS’s more narrow “experiment aversion,” both in our new within-subjects experiments and in the studies we reported in PNAS (evidence MDS ignore in their blog post and PNAS Letter). We end by addressing MDS’s objections to our Checklist and Best Drug scenarios and making some observations on their research and reporting practices.

2. The mathematical fact behind MDS’s critique

The core of MDS’s argument is a simple mathematical fact: given pairs of samples from two distributions with different means, the mean of the minimums must be less than or equal to the minimum of the means (in math, mean(min(A_i,B_i)) <= min(mean(A_i),mean(B_i)), where A_i and B_i are the ith respondent’s ratings of A and B respectively). In the context of our studies, if each participant is asked to rate their approval for two different policies, taking the lower rating of each person’s pair of ratings, and then taking the average of these lower ratings, is guaranteed to be less than or equal to the result of first averaging over all participants for each policy and then taking the minimum of these means.

We have no argument with this mathematical observation. In theory, it could account for what MDS call “experiment aversion.” In other words, when people rate experiments comparing two policies, A and B, as less appropriate than implementing either Policy A or Policy B alone for everyone, it could just be that in the experimental condition they see both policies and their rating of the experiment is just the lower of their ratings of the two policies. In other words, people’s rating of the A/B test could be equal to their rating of the policy (A or B) that they, personally, like the least. In essence this is what MDS claim to have found in their own in-press paper.

Finally, we acknowledge that in most of our scenarios we did not directly test for this possibility. As noted earlier, we adopted a between subjects design; thus, in most of our scenarios, only the participants in the A/B condition saw both policies. We therefore compared the mean of the A/B condition with the minimum (and mean) of the A and B means, but were not able to compare the A/B condition to the mean of the minimums.

In other words, we agree that MDS have proposed a hypothesis that could explain why participants often rate A/B tests less favorably than universally implementing either policy A or B.

3. MDS draw inappropriate conclusions from this fact

If all they had done were to propose an additional possible explanation for the A/B Effect findings in our Study 3, we would not object. However, their critiques go further, claiming via simulations that the mathematical fact they rely on completely accounts for all our findings, rendering our claim about the existence of an A/B Effect invalid.

Their empirical exercise combines a small amount of new data (99 MTurkers judging the appropriateness of the A and B policies from all seven experiments in Study 3 of our PNAS paper) with some of our data (our participants’ evaluations of the seven A/B test and policy conditions from the same study) to simulate what would have happened if our participants had evaluated not just the A/B test, but also the A and B policies. In effect, they are simulating the outcome of within-subjects experiments and then claiming to have found no evidence that such experiments would find the A/B test to be judged worse than the worst policy. Their simulation assumes, however, that they can infer how our participants who saw only the A/B test would have rated the A and B policies, had they also seen them. This assumption is unwarranted, as our within-subjects experiments suggest (see Section 4, below); A/B tests are sometimes rated differently when the raters are explicitly told that A and B could have been implemented unilaterally, than when they are only told about the A/B test.

There are other problems with MDS’s second, more expansive claim, First, even if it were true that lower ratings for experiments are always fully explained by individuals’ least preferred policy, it simply does not follow that people never object to experiments comparing unobjectionable policies. (This claim is made in the subtitle of MDS’s in-press Marketing Science paper, which is “People don’t dislike a corporate experiment more than they dislike its worst condition.”) This first problem amounts to a difference in definitions. MDS present their finding—that they fixed a “problem with the statistical analysis” in our paper—as reflecting a mathematical fact. However, their conclusion depends on a particular definition of experiment aversion, in which an individual participant presented with all three possibilities—A, B, and A/B (the A/B test)—ranks A/B the lowest. Although this definition makes sense in light of MDS's objectives, as explained above, it does not make much sense in the real world.

Put another way, if any part of participants’ objections to an experiment derives from the fact that they, unlike their peers who only see A or B, have been made aware that alternatives exist, that is a mechanism of the psychological phenomenon we reported. Rather than being a “confound,” we view the nature and extent of this mechanism as an important empirical question. Presumably MDS feel differently, but that is a difference of opinion regarding which phenomena are interesting to study, and not at all a “problem with the statistical analysis” on our part that they “corrected.”

Second, it is not true—both in the studies that we have conducted since (see Section 4) and in the studies we have already published (see Section 5)—that people object to experiments no more than they object to their least-preferred of the policies the experiment contains. To the contrary, people often rate experiments as significantly worse than the policy they personally regard as worst (and explain their ratings by objecting explicitly to the experiment rather than to the treatments it compares).

4. Within-subjects evidence for the A/B Effect and “experiment aversion”

MDS propose that a true test of experiment aversion is one in which a participant’s rating of an A/B test can be compared directly to his or her least-preferred option (A or B). We have now run several of our own scenarios using a fully within-subects design that allows this test to be run. The results of our within-subjects studies clearly support the existence of not only our A/B Effect but also MDS’s “experiment aversion.” Below, we explain some context and the timing of these experiments, followed by a short summary of the results. Interested readers can refer to our preprint.

We found evidence for an A/B Effect and for MDS’s definition of experiment aversion in three out of three high-powered, preregistered experiments where we adapted scenarios from Meyer et al. (2019) to a within-subjects design. We began these experiments before learning about MDS’s critique. [1] The scenarios we ran included Hospital Safety Checklist (where a hospital director chooses between placing safety reminder checklists on posters in procedure rooms or on doctors’ ID badges), Best Drug: Walk-In Clinic (where some doctors prescribe Drug A and others prescribe Drug B to all of their patients, effectively randomizing which drug patients receive when they walk into the clinic), and Consumer Genetic Testing (where a genetic testing company chooses to either return medically actionable genetic test results or all genetic health results, including those which customers can do nothing about). We chose these scenarios because they varied in domain (health, clinical medicine, or corporate), the decision-making agent (a hospital director, a doctor, or a CEO), and the size of the A/B Effect observed in the between-subjects versions.

The only difference between these experiments and those reported in our original paper was the design: here, participants rated all three options (Policy A, Policy B, or an A/B test presented in counterbalanced order) on the same page. This design gives participants the option to rate their least-preferred policy (of A or B) equally to the A/B test—a pattern that should emerge if MDS’s hypothesis is correct. This is not what happened. We refer readers to our preprint for more details, but the results are described in brief below.

First, we can simply look at how many participants objected to each condition. Figure 1 shows the percentages of participants who objected to A, B, or the A/B test (by rating them as somewhat or very inappropriate). Whereas only ~10% of participants objected to each policy option across the three scenarios, more than one-third objected to the experiment—even when these participants had also read about the two unilateral policy options. This result also refutes the additivity claim that MDS make in their critique via a dessert-allergy analogy: in each scenario, the striped bar is substantially taller than the sum of the black bar and white bar. Participants in these studies felt it was more appropriate to assign A or B to everyone than to run an A/B test with the explicit purpose of learning which policy is most effective, even when they had complete information about all three options.

Figure 1 (from Heck et al., 2019)

Second, this design allowed us to test MDS’s proposed hypothesis directly, so we did. Following the exact summary of this test in their blog post, “Policies A and B were evaluated by the same people, and we compared the least acceptable policy for each participant to the acceptability of the experiment.” In other words, we took only the lower-rated policy between A and B for each participant, averaged over this measure, and compared it with the average rating of the experiment (computed over all participants). Figure 2 shows participants’ average ratings of Policy A (A symbols), Policy B (B symbols), the A/B test (X symbols), each participant’s average policy rating (black circles), and the measure of each participant’s least-preferred policy (black diamonds). Participants viewed the A/B test conditions as less appropriate than they viewed A, B, mean(A,B), and their least-preferred between the two (min(A,B)). By all measures and accounts, these results refute the premise that people don’t object to experiments.

Figure 2 (from Heck et al., 2019)

Finally, a fully within-subjects design allows us to conduct an analysis that is even more restrictive than what MDS propose. We argue that it would be face-valid evidence for the A/B Effect if a participant explicitly objects to an A/B test (by rating it somewhat inappropriate or very inappropriate) while also choosing not to object to either Policy A or Policy B (by rating both as very appropriate, somewhat appropriate, or neither appropriate nor inappropriate). In each experiment, approximately 27% of participants did just this. In short, nearly one-third of participants demonstrated the strongest possible version of the A/B Effect by objecting to an A/B test while simultaneously choosing not to object to unilateral policy implementation of the untested policies it is designed to evaluate. [2]

5. Evidence from our PNAS studies that MDS ignore

Aside from our definitive within-subjects studies, there was already significant evidence in our PNAS studies to cast doubt on MDS’s claim that the A/B Effect wholly reduces to aversion to individuals’ least-favored policies.

First, in some of our scenarios, the percentage of participants who object to the A/B test exceeds those who object to policy A plus those who object to policy B. Even assuming zero overlap between those who dislike policy A and those who dislike policy B (a fairly implausible assumption), then, cannot fully account for the experiment aversion we observe in those scenarios.

Second, as we note in our PNAS article, anticipating this concern about joint evaluation is what led us to run Studies 4 and 5 in Meyer et al. (2019), which provide good evidence that participants were objecting to experimentation and not merely to a disliked policy. In these studies, the A/B tests compared generic treatments—“Drug A” and “Drug B.” Unlike MDS’s hypothetical A/B test in which 30% of people (those with lactose intolerance) hate the dairy treatment, 30% of people (those with peanut allergies) hate the peanut treatment, and therefore 60% of people hate the A/B test, there is no rational basis for one group of raters to hate “Drug A” and another group to hate “Drug B,” yielding an A/B Effect that is driven entirely by individual raters’ least-preferred policies. And so MDS’s lactose-peanut hypothesis cannot explain the fact that, in our original and replication MTurk experiments, about 10% of people in each policy condition objected to the agent’s decision to prescribe Drug A (or B) for everyone, but about 35% of people in the A/B conditions objected.

Finally, one obvious way to determine why people disapprove of experiments is to ask them. For this reason, in all of our experiments we asked our participants to tell us why they gave the ratings they gave. Of the 16 experiments reported in PNAS, we coded free responses from 4 (covering the Checklist and Best Drug: Walk-In scenarios), using two independent raters and a codebook of 28 codes. As we noted in the article (p. 10725), a fairly small minority of A/B test participants remarked that one policy was superior to the other or objected to unequal treatment of people, per se, and excluding those participants still yielded a large A/B Effect:

14% of participants in the A/B conditions of study 1 and its first replication [Checklist] commented that one policy was preferable or that the A/B test treated people unequally. Notably, excluding these participants still yields a substantial A/B effect in study 1, t(388) = 11.3, p = 0.001, d = 1.14, and its replication (study 2a), t(354) = 7.30, p = 0.001, d = 0.78.

Although in their letter to PNAS MDS claim to have invalidated our results based on the so-called “minimum mean paradox” described above, they then go on to raise other unrelated objections to the studies reported in our paper for which (as we show in our forthcoming reply in PNAS) their analysis fails to deliver their desired result. We turn to those objections now.

6. Responses to MDS's criticisms of our health domain designs

MDS dismiss the Checklist scenario as “confounded” because the A/B condition differs from the policy conditions not only in involving a randomized experiment, but also in a second way: Only in the A/B condition, they say, are raters confronted with a decision to forgo a potentially life-saving intervention for no disclosed reason, whereas in the policy conditions, “there is only one option available.” But this is not true. When, in the A condition, the hospital director decides to implement a Badge policy for everyone, he necessarily forgoes (without explanation) a Poster policy (and any number of alternative or additive policies), any one of which might have superior life-saving powers than Badge, either as a substitute for Badge or as an addition to it. The fact that policy implementers rarely disclose alternative or additional policies they have foregone does not mean those policies are not “available” or that the policy implementer’s decision “caused” deaths any less than did the director who ran an A/B test.

We agree entirely that our A/B test descriptions necessarily make extremely salient the fact that some people will be denied Badges and others will be denied Posters, whereas people rarely take away from an announcement of a universal Badge policy the consequence that everyone is being deprived of Posters. But that is the nature of how A/B tests and universal policies are announced, respectively, in the real world, and that real-world setting is precisely what we attempted to capture in our vignettes (see our previous comments about external validity). To the extent that A/B tests are perceived to deprive people of important interventions while decisions to choose between those same interventions in the form of a universal policy implementation are not so perceived, that is a mechanism of the A/B Effect, not a “confound” to be avoided. [3]

MDS suggest that this “confound” could be avoided by modifying the scenario so that the agent in all three conditions considers two options and opts for only one, for no disclosed reason. Of course, that is exactly what we did in Studies 4, 5, and 6b. In Study 4, participants in all conditions (A, B, and A/B) were told that multiple FDA-approved blood pressure drugs exist, but Dr. Jones decides to give all of his patients “Drug A” (or “Drug B,” or conduct an A/B test of Drugs A and B to see which helps his patients most and offer them that one going forward). Similarly, in Study 5 (replicated in a health provider sample in Study 6b), all participants in all conditions are told that some doctors in a walk-in clinic prescribe “Drug A” to all of their patients while others prescribe “Drug B to all of their patients.” Dr. Jones then either decides—for no disclosed reason—to give all his hypertensive patients Drug A (or Drug B, or conduct an A/B test).

However, MDS have two reasons for dismissing the results of these scenarios. First, they say, only in the A/B condition is Dr. Jones uncertain about how to treat his patients. This is partially true—and entirely irrelevant. It is true in the sense that Dr. Jones, as an individual, demonstrates uncertainty only in the A/B condition, as do all agents who embark upon an A/B test (although we prefer to describe them as having epistemic humility). But all of the walk-in clinic doctors collectively demonstrate uncertainty about the best treatment for their patients. Participants rating all conditions are told that some doctors prescribe Drug A to all of their patients, while others prescribe Drug B to all of their patients—and these patients walk in off the street and are heretofore unknown to the doctors, so it is not the case that either patients or doctors are selecting one another in ways that would make a particular drug make more sense. Hence, this is a case, as we noted in PNAS, of unjustified variation in medical practices: the very common situation in which providers choose and assign to all their patients one treatment according to accidents of how they happened to have been trained, or which conferences they happen to have attended, and so on. The fact that experts and other agents are punished for having the epistemic humility to conduct an A/B test while agents who use their intuition to pick one policy for everyone are rewarded for being (overly) confident is a fascinating mechanism of the A/B Effect (we call it a proxy version of the illusion of knowledge in the PNAS paper), not a reason to dismiss evidence of the effect.

Second, MDS say that in the Best Drug scenarios, only in the A/B condition is Dr. Jones “doing something potentially illegal (running a medical trial without oversight).” First, although all of our scenarios were silent about things like consent, notice, and IRB oversight, some of our participants assumed (not unreasonably) that the agents in our vignettes had no intention of pursuing any of these things (we plan to conduct follow-up experiments in which we experimentally manipulate these aspects of the vignettes to isolate the extent to which they contribute to the effect).

As for whether Dr. Jones is doing something illegal, although pre-market drug (and device) trials are indeed heavily regulated by the FDA, this scenario involved drugs already approved by the FDA. In the U.S. (all our participants were U.S. residents), the primary source of regulation of Dr. Jones’s A/B test would be the federal regulations governing human subjects research (known as the Common Rule). The Common Rule only applies by federal law to human subjects research funded by one of several federal departments or agencies. Since Dr. Jones “thinks of two different ways to provide good treatment to his patients, so he decides to run an experiment,” it seems unlikely that he has an in-hand NIH grant to support this activity. Most research institutions do apply the Common Rule as a matter of institutional policy to all human subjects research conducted there, but failure to seek IRB approval wouldn’t constitute an “illegal drug trial.” Moreover, the Common Rule only applies to activities “designed to develop or contribute to generalizable knowledge.” The Common Rule unhelpfully does not define this phrase, and IRBs can and do arrive at different interpretations. An IRB could find that all of the vignettes we constructed constituted (unregulated) quality improvement (QI) activities (where neither IRB approval nor consent is required) rather than (regulated) human subjects research. Randomization does not automatically make an activity human subjects research and health systems can and do conduct “quality improvement RCTs” without IRB review or consent (e.g., Horwitz, Kuznetsova & Jones, 2019). Even when an IRB determines that an activity constitutes research, the Common Rule permits waivers for minimal risk research that would not otherwise be “practicable”—another undefined term that IRBs interpret differently, but can include scenarios where notifying patients that they are in a trial would would invalidate the results.

MDS acknowledge that they do not know whether Dr. Jones’s A/B test was illegal or not, and say that it is a problem that our participants probably don’t know, either. But to the extent that misconceptions about when things like consent and oversight are and are not legally (or ethically) required make randomized evaluation harder, then that is one more mechanism of the effect we are studying.

7. Observations on MDS’s presentation of evidence

Finally, we remark on some curiosities in the way MDS present their case against our PNAS paper. First, they focus exclusively on the seven scenarios we tested in our Study 3, dismissing the ones from our Studies 1, 2, 4, 5, and 6 as irrelevant for idiosyncratic reasons. This means they are dismissing the totality of our results purely on the basis of an empirical exercise they conduct on less than half of our experiments—coincidentally, the ones in which we found the smallest effects (including one we reported as a null effect). In their PNAS Letter, they justify this choice by claiming that our Study 3 is the most similar to the studies they conducted for their in-press Marketing Science paper. But mere similarity to one’s own work is not a scientific rationale for selecting what part of someone else’s work can stand in for the whole. As we show in our forthcoming PNAS Reply, replacing just 3 of the 7 scenarios MDS chose to present with 3 of the ones they left out renders the evidence for their own form of “experiment aversion” significant in our original PNAS data.

Second, MDS went to the trouble of recruiting nearly 100 MTurkers to provide evaluations of the 14 policies from our Study 3, but they didn’t take the easy extra step of having the same participants evaluate the A/B tests as well. That is, they stopped just short of carrying out the within-subjects experiment that would have definitively tested their claim.

Third, as far as we can tell, MDS did not preregister the hypotheses or the data-collection and analysis plans for their empirical exercise. (In our PNAS paper, except for Study 1, every pilot, main study, and replication we reported was pre-registered, and we reported every single study we had done on this topic.) This is especially odd because Uri Simonsohn is a proprietor of AsPredicted, a popular website and service for pre-registering studies.

Fourth, in the main text of their Data Colada blog post, MDS introduce their discussion of our Checklist and Best Drug scenarios (Studies 1, 2, 4, 5, and 6 from our PNAS paper) as follows: "Next, the discussion that did not fit in the PNAS letter: the concerns we had with the designs of the 2 scenarios we did not statistically re-analyze" (emphasis added). However, we recently noticed that point 4 in footnote 3 of their post contradicts this statement. That footnote implies that MDS carried out their empirical exercise using nine of our scenarios, not just the seven they report in their PNAS Letter and the main body of their post, and found data supporting our hypothesis rather than theirs. As shown below, that footnote states "... in what we refer to as the 8th and 9th scenarios in the PNAS paper, the only two where in our re-analyses, experiments were objected to more than to their worst individual policy" (emphasis added).

There is no other mention of those other “re-analyses” in their PNAS Letter, blog post, or corresponding OSF archive, so we are confused.

8. Conclusion

We regret that our dispute with MDS has resulted in adversarial blog posts and PNAS Letters. We have been fans of much of Simonsohn’s previous work and of his blog. We found MDS’s in-press paper intriguing, and we stated publicly that we looked forward to doing studies that might explain why they reached different conclusions from us.

After MDS’s Letter was sent to PNAS, and after we submitted to PNAS our requested Reply, we discussed with Simonsohn some alternatives to this public argument, such as writing a joint commentary, or editing each other’s letters to resolve as many disagreements as possible before publication, or even starting an adversarial collaboration (e.g., Mellers, Hertwig, & Kahneman, 2001) to collectively gather new data that might sort things out. We remain open to working or at least having constructive discussions in the future with MDS and anyone interested in this topic.

Notes

[1] PNAS first contacted us on August 20, 2019, to tell us that MDS had submitted a Letter and to request our response. Our first within-subjects experiment was pre-registered on July 29 (materials testing preregistered on July 26). In that experiment, we showed and asked participants to rate all three conditions from the Meyer et al. Checklist scenario—but one at a time (in randomized order). After sequentially rating each of the director’s three choices, we invited them to revise their ratings after they were exposed to all three options. The results of this “Checklist—sequential” experiment support the A/B Effect and experiment aversion more narrowly, but we plan to report them in the Supplemental Information that will eventually accompany the preprint. Next, on August 12, 2019, we conducted a within-subjects experiment of Checklist in which participants were simultaneously shown and asked to rate all three decisions, presented in randomized order (materials testing preregistered on August 8). These results are reported in our preprint. On September 2, we preregistered a “simultaneous” within-subjects test of the Best Drug: Walk-In scenario from Meyer et al. (materials testing preregistered on September 1), also reported in the preprint. Finally, on October 15, we preregistered “simultaneous” within-subjects tests of three Study 3 scenarios from Meyer et al.: Direct-to-Consumer Genetic Testing, Autonomous Vehicles, and Retirement Savings. We preregistered and conducted materials testing for all three scenarios on October 14, but to date, we have only conducted a full experiment with the DTC Genetic Testing scenario, reported in the preprint. We will conduct the full experiments with Autonomous Vehicles and Retirement Savings soon.

[2] In these within-subjects experiments, we also asked participants to rank the agent’s three options (A, B, and A/B test) and in each of the three scenarios we tested, we find large portions of participants who rate the A/B test as the decision-maker’s worst option. We observe this, even though an A/B test necessarily assigns only 50% of people to the rater’s least-preferred policy—which ought to be superior to the option of assigning 100% of people to the rater’s least-preferred policy, if people’s dislike of experiments only ever reflects their dislike of the worst policy it contains.

[3] We take the main thrust of MDS's objection to Checklist to be the alleged "confound," in which A/B raters, but not policy raters, are confronted with "causing deaths for a trivial benefit (e.g., saving cost)." For the sake of comprehensiveness, we note that in the real world, especially in the non-profit world of hospitals and health systems that operate on the slimmest of margins, resources are often extremely constrained. Reprinting and redistributing all provider badges—or printing and hanging posters—is not a "trivial" decision, especially if scaled across a large system. Moreover, even assuming a magical world in which money is no object, more interventions are not always superior to fewer interventions. Finally, two or more treatments are often mutually exclusive (as is the case with blood pressure Drug A and Drug B in our Best Drug scenarios).

Thursday, September 8, 2016

Why Colleges And Universities Should Not Disinvite Speakers

Since 2000, more than 140 people who have been invited to speak on American college and university campuses have been “disinvited” before they could give their talks, usually after objections from students. It is easy to find news accounts of these events, or non-events: this handy database details (as of this writing) 342 successful and unsuccessful disinvitation campaigns. One of the most prominent recent examples may be the New York University administration's decision in September 2016 to cancel a lecture by the Nobel prizewinning biologist James Watson, on strategies for curing cancer, six days before it was scheduled to happen, because of student complaints about statements Watson had made on other topics in the past. Later in the same academic year, Watson was also disinvited from giving a talk on the same topic at the University of Illinois.

Notably, these events were not reported in any mainstream media outlets, which mostly seem to regard suppressing lectures as normal business in academia today. Indeed, campus groups are less likely to invite controversial speakers in the first place, given how likely it is that such invitations will meet with opposition and possible cancellation. It is much harder to find stories about academic leaders who clearly rejected demands for disinvitation and clearly explained why. If I were a college president, and a campaign to disinvite a speaker arose on my campus, here is the letter I would write.

To The Campus Community:

Recently, an organization on our campus announced that a particular person has been invited to speak here. Many students, some faculty, and a few alumni of our institution have publicly objected to the invitation of this speaker. Some have demanded that his invitation be rescinded, so that he will not be able to use our “platform,” and the imprimatur of our college will not be attached to the controversial things he was expected to say—or to anything he has said or written in the past. It has been alleged that the speaker will make some students feel uncomfortable or unsafe, that his beliefs are repugnant, and that his ideas are not rational or grounded in solid evidence.

I have heard these demands, and listened to the arguments of their supporters. I am writing to say that I do not agree with them, to announce that the speaker’s talk will go forward as planned, and to explain why.

First, let’s be clear that this is not a matter of “freedom of speech” in the legal sense. The First Amendment to the U.S. Constitution, as interpreted by the courts, says that governments may not abridge freedom of speech, but it puts no restrictions on private institutions like our college. Disinviting someone is unprofessional and rude, but we have the legal right to disinvite—or to not invite in the first place—anyone we please.

Some will say that if I do not disinvite this speaker, I am therefore supporting him. This is not the case. There is a clear logical distinction between endorsing a person’s claims and beliefs, and giving him an opportunity to express those claims and beliefs. There are many speakers who have come to our campus with whom I disagree, but I did not block them. We do not ban books from our library or websites from our network, so we do not ban speakers from our grounds. Constitutional law does allow for some non-commercial speech, such as explicit incitements to violence or disorder, to be punished by the state. And I would certainly agree to interrupt or block a speech that was directly threatening anyone’s immediate physical safety. But beyond that, it is not my place to decide who is and is not permitted to speak here.

In fact, no one’s personal preferences should have anything to do with this question. To see why, we must consider what our college, or any college, is here for. An institution of higher education is organized around the concept of learning. Learning is why we are all part of this community. Students are here to learn about the arts, sciences, and other disciplines they pursue. Professors are here to learn as well: to learn entirely new things about the natural, social, and humanistic worlds, and to learn how to teach more effectively. The staff and administration are here to make these endeavors possible.

Learning doesn’t just mean going to class, doing homework, and taking exams. If a group of students or faculty members are so interested in hearing, debating, and engaging with the ideas of a person from outside our community that they decide to invite him here and organize and attend an event, I cannot rebuke them. In fact, I congratulate them, for they are engaged in an act of learning that goes beyond what is strictly required of them. They are spending their personal time and energy furthering the central purpose of our institution. Even though I may not like all the speakers they are selecting, I still love the fact that they are bothering to select speakers at all.

The reputation of our college and the value of the degrees we confer will not be affected by the speakers we host, but it will suffer if we acquire a reputation for stifling unpopular views. A college does not need to “manage” its “brand,” and it should not act like a for-profit company in this respect. All colleges stand for excellence in scholarship; that is the only brand that matters, and disinviting speakers and suppressing thoughts will only cheapen it.

Even speakers who espouse ideas you find dangerous and are sure you would never accept—like a creationist, a 9/11 “truther,” a genocide denier, or someone who argues that “rape culture” is a liberal myth—may be worth hearing. If you find out what they claim for evidence, and the kinds of words, phrases, and arguments they use, you can better rebut them yourself—whether you are reasoning with other people, or questioning your own beliefs. Listen to the other side's case in order to strengthen your own. In other words, know thy enemy.

To those who say the speaker may make them feel unsafe, I must point out that higher education is not designed to make people safe. Instead, it is our society’s designated “safe space” for disruptive intellectual activity. It’s a space that has been created and set apart specifically for the incubation of knowledge, by both students and faculty. Ideas that may seem dangerous or repugnant can be expressed here—even if nowhere else—so that they can be analyzed, discussed, and understood as dispassionately as possible. Many of humanity’s greatest achievements originated as ideas that were suppressed from the public sphere. Some, like the theory of evolution by natural selection, equal rights for women and minorities, trade unions, democracy, and ironically even the right to free speech and expression, are still seen as dangerous decades and centuries later.

If you are against this speaker coming here, please also consider this: Some members of our community—some of your friends and colleagues—do want him to visit. By asking me to disinvite him, you are implicitly claiming that your concerns and preferences are more important than those of the people who invited him. Are you really sure that you are so right and they are so wrong? Psychologists have found that people tend to be overconfident in their beliefs, and poor at taking the perspective of others. That might be the case here.

A decision by me to bar this speaker would have far-reaching negative repercussions. It will make everyone in our community think twice before they stage a provocative event or invite a controversial speaker. Canceling this invitation will not only prevent this person from talking; it will reduce the expression of views like his in the future, and probably chill speech by anyone who could be regarded as controversial. And it will set a precedent that future leaders in higher education may point to if they feel pressured to do the same. All of this would be antithetical to our common purpose—and our institution's social function—of learning and discovery.

Note that it’s especially important for us to be open to viewpoints not already well-represented among our faculty. The professors here are a diverse group, but many studies have shown that professors tend to be more politically left-wing than the population at large. Even the most conscientious instructor may inadvertently slant his teaching and assignments towards his own political viewpoint. Of course, this applies more in the social sciences and humanities than in math or physics, but it does happen. Giving campus organizations wide latitude to invite the speakers they wish helps to increase the range of thoughts that are aired and discussed here.

If you feel that this speaker’s talk might upset you, I offer this advice: Go. Yes, go to the talk, listen to it, record it—if the speaker and hosts give permission—and think about it. Expose yourself to ideas that trouble you, because avoiding sources of anxiety is not the best way to cope with them. When you encounter troubling ideas on our campus, try to desensitize yourself to emotional reactions by keeping in mind that ideas themselves cannot hurt you.

And please do not try to distract, interrupt, or shout down the speaker. "Deplatforming" a speaker risks making him into a "free speech martyr" who will attract more followers because he is seen as a teller of truths so dangerous that his opponents try to ban them. Just ignore him, rather than spark an explosion that might win him new fans or deepen the ardor of the ones he already has. It is a natural impulse for us to suppress speech that we don’t like, just as it is natural for us to retaliate against or outlaw behavior we don’t like. That’s why we have laws to protect unpopular speech and institutions to foster and study it. You can take this golden opportunity to train yourself to respond to speech that upsets you by listening to it, analyzing it, looking up its sources, developing reasoned counterarguments to it, and considering why people agree with it and whether it might not be as contemptible as you have been told. These are the intellectual skills that all members of our community are committed to building.

In fact, if you’re already committed to everything this speaker is against, then you should definitely listen to him. John Stuart Mill wrote, “He who knows only his own side of the case, knows little of that.” When you never encounter people who vigorously argue for positions you don’t agree with, you may come to believe that those arguments don’t have merit, don’t deserve to be heard, or don’t even exist. The argument you imagine your opponents making is probably weaker and easier to dismiss than the argument they would actually make if they had the chance.

Of course, you don’t have to listen to speakers you disagree with. That’s the beauty of our system: We are all committed to the broad goal of learning, but we are never forced to attend to people we can’t stand. If you want to protest this speaker, do so peacefully, outside the venue, and do not block anyone from attending. Hand out fliers or arrange for other speakers to present counterarguments or different ideas. As Justice Louis Brandeis said, “If there be time to expose through discussion the falsehood and fallacies, to avert the evil by the process of education, the remedy to be applied is more speech, not enforced silence.” And if just being in the speaker’s presence will cause too much discomfort, you may avail yourself of the truly safe space of your dorm room or apartment, or the company of other like-minded students.

Please be careful, though, about making a habit of avoiding or trying to suppress uncomfortable ideas. In the wider world there are no spaces where you can be safe from the thoughts in other people’s heads, so if people are stereotyping you or otherwise judging you unfairly, nothing that restricts speech here on our campus will solve that problem. Holding negative thoughts and uttering negative speech are a part of human nature that our college does not exist to protect you from. On the contrary, we exist to arm you with the intellectual tools to understand, analyze, and dispute incorrect ideas. Shutting off those ideas does nothing to inoculate you against them, and may ironically make you even more vulnerable in the future.

The same is true of our college—of our community—as a whole. Once an organization stops challenging itself with ideas, old or new, it becomes intellectually flaccid and surrenders any claim to scholarly excellence. Valuing comfort and community over openness to ideas is perfectly fine for many organizations. Religions, charitable causes, and political parties are free and sometimes even wise to exclude ideas and people that they disagree with. But the essential common value in a university is free inquiry for the purpose of learning, and by joining the university we have all sacrificed our right to be safe from ideas we disagree with. Community is important here, but openness is fundamental. It’s for that simple reason that I will not disinvite any speaker who has been legitimately invited to talk to us.

Ruth Simmons, the former president of Brown University, told the graduating Smith College class of 2014, “The collision of views and ideologies is in the DNA of the academic enterprise. We don't need any collision-avoidance technology here.” I could not agree more. Therefore, as the leader of this community of scholars, of our academic enterprise, I would be doing the opposite of my duty were I to force silence on this or any other speaker. I hereby decline the requests to disinvite him. And I encourage all campus groups and organizations to invite the speakers they want to hear, knowing that I will respect and support your efforts to learn and engage with their ideas.

Sincerely,

Your College President

NOTE: This was originally published on 8 September 2016 and was revised and updated on 4 June 2017 and 7 April 2018. If you represent a mainstream online or print publication and would like to publish a version of this essay, please contact me.