Thursday, November 7, 2019

For Better or Worse, the A/B Effect Is Real

1. Introduction

Earlier this year we published a set of studies in PNAS (open access article; Supplemental Information; OSF preregistrations, data, and code) on what we called the “A/B Effect,” which we defined as occurring when people “judge a randomized experiment comparing two unobjectionable policies or treatments (A and B), neither of which was known to be superior, as less appropriate than simply implementing either A or B for everyone” (Meyer et al., 2019, p.102074). More precisely, we found that when separate groups of participants rated the appropriateness of policy A (for example, giving teachers a yearly bonus), policy B (e.g., giving teachers more vacation days), or an A/B test with the explicit goal of learning which policy leads to the best outcome (and then making that the policy for everyone going forward), the A/B test received lower mean appropriateness ratings, and more participants explicitly rated it as inappropriate, than either A or B. We observed this effect to varying degrees when people evaluated vignettes in several domains, including health care, humanitarian aid, education, tech, and “corporate” experimentation. We concluded that policymakers might face more backlash from running A/B tests to determine the effects of their proposed policies than simply using their intuition to pick and impose a policy on everyone—a possibility that we, as fans of evidence-based decision-making, find lamentable.

Today, Berkeley Dietvorst, Robert Mislavsky, and Uri Simonsohn have posted on Simonsohn's Data Colada blog a critique of our paper. Their post is an expanded version of a Letter by Mislavsky, Dietvorst, and Simonsohn (henceforth MDS) published yesterday in PNAS (our Reply will appear in PNAS soon) (UPDATE: our Reply was published in PNAS on 12 November 2019). MDS claim that “experiment aversion” (a term they use to refer to the A/B Effect we described, but to which they attach a more narrow meaning—a point we will elaborate on below) can be explained away as the result of a statistical artifact, and is therefore not actually occurring in our studies. We lay out this argument in section 1 below, but it boils down to a claim—for which they report evidence in their own paper that is in-press at Marketing Science—that people’s rating of an experiment is never lower than their rating of the worst policy that that experiment contains. Because our PNAS studies were all between-subjects, where participants rated A or B or the A/B test, but not all three, we can only compare one group’s ratings of the A/B test to the ratings of A and B made by the other two groups of participants.

In designing our studies, we made a deliberate decision to conduct between-subjects experiments. In the real world, people are rarely given the opportunity to evaluate option A, option B, and a test comparing these. When a policy is implemented, it is rare that rejected alternative policies—or policies the decision-maker never thought of—are announced alongside of it. Similarly, when an A/B test comparing two policies is announced, it is virtually never pointed out that the decision-maker could have chosen to impose A or B on everyone. The world, in other words, is a between-subjects experiment, where people rarely realize or contemplate the fact that they are, in effect, subject to policies and decisions that make them participants in any number of uncontrolled experiments. To choose anything other than a between-subjects design, therefore, would have sacrificed the external validity of our studies.

We did think, however, that alerting raters of A/B tests to the fact that agents often have the power to simply impose either untested policy on everyone might improve perceptions of experiments and/or that alerting raters of policies to foregone alternatives might make people more skeptical of policies. Either (or both) would have the effect of reducing the A/B Effect—and mitigating people’s objection to experimentation is one of the ultimate goals of our research program. To that end, we had long planned to run within-subjects experiments, some of which we have now conducted and reported in a preprint and describe below. In all three of the scenarios from Meyer et al. (2019) that we have so far tested in a fully within-subjects design, including one from the Study 3 scenarios that MDS prefer, we find robust evidence not only of the A/B Effect, as we defined it (AB < mean(A,B)), but also of increasingly conservative forms of “experiment aversion,” including MDS’s definition (AB < min(A,B)) and an even more conservative definition which requires individuals to rate the AB test as inappropriate while also not objecting to policy.

In this essay, we analyze MDS’s arguments and explain why there is strong evidence for both the A/B Effect and for MDS’s more narrow “experiment aversion,” both in our new within-subjects experiments and in the studies we reported in PNAS (evidence MDS ignore in their blog post and PNAS Letter). We end by addressing MDS’s objections to our Checklist and Best Drug scenarios and making some observations on  their research and reporting practices.

2. The mathematical fact behind MDS’s critique

The core of MDS’s argument is a simple mathematical fact: given pairs of samples from two distributions with different means, the mean of the minimums must be less than or equal to the minimum of the means (in math, mean(min(A_i,B_i)) <= min(mean(A_i),mean(B_i)), where A_i and B_i are the ith respondent’s ratings of A and B respectively). In the context of our studies, if each participant is asked to rate their approval for two different policies, taking the lower rating of each person’s pair of ratings, and then taking the average of these lower ratings, is guaranteed to be less than or equal to the result of first averaging over all participants for each policy and then taking the minimum of these means.

We have no argument with this mathematical observation. In theory, it could account for what MDS call “experiment aversion.” In other words, when people rate experiments comparing two policies, A and B, as less appropriate than implementing either Policy A or Policy B alone for everyone, it could just be that in the experimental condition they see both policies and their rating of the experiment is just the lower of their ratings of the two policies. In other words, people’s rating of the A/B test could be equal to their rating of the policy (A or B) that they, personally, like the least. In essence this is what MDS claim to have found in their own in-press paper.

Finally, we acknowledge that in most of our scenarios we did not directly test for this possibility. As noted earlier, we adopted a between subjects design; thus, in most of our scenarios, only the participants in the A/B condition saw both policies. We therefore compared the mean of the A/B condition with the minimum (and mean) of the A and B means, but were not able to compare the A/B condition to the mean of the minimums.

In other words, we agree that MDS have proposed a hypothesis that could explain why participants often rate A/B tests less favorably than universally implementing either policy A or B.

3. MDS draw inappropriate conclusions from this fact

If all they had done were to propose an additional possible explanation for the A/B Effect findings in our Study 3, we would not object. However, their critiques go further, claiming via simulations that the mathematical fact they rely on completely accounts for all our findings, rendering our claim about the existence of an A/B Effect invalid.

Their empirical exercise combines a small amount of new data (99 MTurkers judging the appropriateness of the A and B policies from all seven experiments in Study 3 of our PNAS paper) with some of our data (our participants’ evaluations of the seven A/B test and policy conditions from the same study) to simulate what would have happened if our participants had evaluated not just the A/B test, but also the A and B policies. In effect, they are simulating the outcome of within-subjects experiments and then claiming to have found no evidence that such experiments would find the A/B test to be judged worse than the worst policy. Their simulation assumes, however, that they can infer how our participants who saw only the A/B test would have rated the A and B policies, had they also seen them. This assumption is unwarranted, as our within-subjects experiments suggest (see Section 4, below); A/B tests are sometimes rated differently when the raters are explicitly told that A and B could have been implemented unilaterally, than when they are only told about the A/B test.

There are other problems with MDS’s second, more expansive claim, First, even if it were true that lower ratings for experiments are always fully explained by individuals’ least preferred policy, it simply does not follow that people never object to experiments comparing unobjectionable policies. (This claim is made in the subtitle of MDS’s in-press Marketing Science paper, which is “People don’t dislike a corporate experiment more than they dislike its worst condition.”) This first problem amounts to a difference in definitions. MDS present their finding—that they fixed a “problem with the statistical analysis” in our paper—as reflecting a mathematical fact. However, their conclusion depends on a particular definition of experiment aversion, in which an individual participant presented with all three possibilities—A, B, and A/B (the A/B test)—ranks A/B the lowest. Although this definition makes sense in light of MDS's objectives, as explained above, it does not make much sense in the real world.

Put another way, if any part of participants’ objections to an experiment derives from the fact that they, unlike their peers who only see A or B, have been made aware that alternatives exist, that is a mechanism of the psychological phenomenon we reported. Rather than being a “confound,” we view the nature and extent of this mechanism as an important empirical question. Presumably MDS feel differently, but that is a difference of opinion regarding which phenomena are interesting to study, and not at all a “problem with the statistical analysis” on our part that they “corrected.”

Second, it is not true—both in the studies that we have conducted since (see Section 4) and in the studies we have already published (see Section 5)—that people object to experiments no more than they object to their least-preferred of the policies the experiment contains. To the contrary, people often rate experiments as significantly worse than the policy they personally regard as worst (and explain their ratings by objecting explicitly to the experiment rather than to the treatments it compares).

4. Within-subjects evidence for the A/B Effect and “experiment aversion”

MDS propose that a true test of experiment aversion is one in which a participant’s rating of an A/B test can be compared directly to his or her least-preferred option (A or B). We have now run several of our own scenarios using a fully within-subects design that allows this test to be run. The results of our within-subjects studies clearly support the existence of not only our A/B Effect but also MDS’s “experiment aversion.” Below, we explain some context and the timing of these experiments, followed by a short summary of the results. Interested readers can refer to our preprint.

We found evidence for an A/B Effect and for MDS’s definition of experiment aversion in three out of three high-powered, preregistered experiments where we adapted scenarios from Meyer et al. (2019) to a within-subjects design. We began these experiments before learning about MDS’s critique. [1] The scenarios we ran included Hospital Safety Checklist (where a hospital director chooses between placing safety reminder checklists on posters in procedure rooms or on doctors’ ID badges), Best Drug: Walk-In Clinic (where some doctors prescribe Drug A and others prescribe Drug B to all of their patients, effectively randomizing which drug patients receive when they walk into the clinic), and Consumer Genetic Testing (where a genetic testing company chooses to either return medically actionable genetic test results or all genetic health results, including those which customers can do nothing about). We chose these scenarios because they varied in domain (health, clinical medicine, or corporate), the decision-making agent (a hospital director, a doctor, or a CEO), and the size of the A/B Effect observed in the between-subjects versions.

The only difference between these experiments and those reported in our original paper was the design: here, participants rated all three options (Policy A, Policy B, or an A/B test presented in counterbalanced order) on the same page. This design gives participants the option to rate their least-preferred policy (of A or B) equally to the A/B test—a pattern that should emerge if MDS’s hypothesis is correct. This is not what happened. We refer readers to our preprint for more details, but the results are described in brief below.

First, we can simply look at how many participants objected to each condition. Figure 1 shows the percentages of participants who objected to A, B, or the A/B test (by rating them as somewhat or very inappropriate). Whereas only ~10% of participants objected to each policy option across the three scenarios, more than one-third objected to the experiment—even when these participants had also read about the two unilateral policy options. This result also refutes the additivity claim that MDS make in their critique via a dessert-allergy analogy: in each scenario, the striped bar is substantially taller than the sum of the black bar and white bar. Participants in these studies felt it was more appropriate to assign A or B to everyone than to run an A/B test with the explicit purpose of learning which policy is most effective, even when they had complete information about all three options.

Figure 1 (from Heck et al., 2019)

Second, this design allowed us to test MDS’s proposed hypothesis directly, so we did. Following the exact summary of this test in their blog post, “Policies A and B were evaluated by the same people, and we compared the least acceptable policy for each participant to the acceptability of the experiment.” In other words, we took only the lower-rated policy between A and B for each participant, averaged over this measure, and compared it with the average rating of the experiment (computed over all participants). Figure 2 shows participants’ average ratings of Policy A (A symbols), Policy B (B symbols), the A/B test (X symbols), each participant’s average policy rating (black circles), and the measure of each participant’s least-preferred policy (black diamonds). Participants viewed the A/B test conditions as less appropriate than they viewed A, B, mean(A,B), and their least-preferred between the two (min(A,B)). By all measures and accounts, these results refute the premise that people don’t object to experiments.

Figure 2 (from Heck et al., 2019)

Finally, a fully within-subjects design allows us to conduct an analysis that is even more restrictive than what MDS propose. We argue that it would be face-valid evidence for the A/B Effect if a participant explicitly objects to an A/B test (by rating it somewhat inappropriate or very inappropriate) while also choosing not to object to either Policy A or Policy B (by rating both as very appropriate, somewhat appropriate, or neither appropriate nor inappropriate). In each experiment, approximately 27% of participants did just this. In short, nearly one-third of participants demonstrated the strongest possible version of the A/B Effect by objecting to an A/B test while simultaneously choosing not to object to unilateral policy implementation of the untested policies it is designed to evaluate. [2]

5. Evidence from our PNAS studies that MDS ignore

Aside from our definitive within-subjects studies, there was already significant evidence in our PNAS studies to cast doubt on MDS’s claim that the A/B Effect wholly reduces to aversion to individuals’ least-favored policies.

First, in some of our scenarios, the percentage of participants who object to the A/B test exceeds those who object to policy A plus those who object to policy B. Even assuming zero overlap between those who dislike policy A and those who dislike policy B (a fairly implausible assumption), then, cannot fully account for the experiment aversion we observe in those scenarios.

Second, as we note in our PNAS article, anticipating this concern about joint evaluation is what led us to run Studies 4 and 5 in Meyer et al. (2019), which provide good evidence that participants were objecting to experimentation and not merely to a disliked policy. In these studies, the A/B tests compared generic treatments—“Drug A” and “Drug B.” Unlike MDS’s hypothetical A/B test in which 30% of people (those with lactose intolerance) hate the dairy treatment, 30% of people (those with peanut allergies) hate the peanut treatment, and therefore 60% of people hate the A/B test, there is no rational basis for one group of raters to hate “Drug A” and another group to hate “Drug B,” yielding an A/B Effect that is driven entirely by individual raters’ least-preferred policies. And so MDS’s lactose-peanut hypothesis cannot explain the fact that, in our original and replication MTurk experiments, about 10% of people in each policy condition objected to the agent’s decision to prescribe Drug A (or B) for everyone, but about 35% of people in the A/B conditions objected.

Finally, one obvious way to determine why people disapprove of experiments is to ask them. For this reason, in all of our experiments we asked our participants to tell us why they gave the ratings they gave. Of the 16 experiments reported in PNAS, we coded free responses from 4 (covering the Checklist and Best Drug: Walk-In scenarios), using two independent raters and a codebook of 28 codes. As we noted in the article (p. 10725), a fairly small minority of A/B test participants remarked that one policy was superior to the other or objected to unequal treatment of people, per se, and excluding those participants still yielded a large A/B Effect:

14% of participants in the A/B conditions of study 1 and its first replication [Checklist] commented that one policy was preferable or that the A/B test treated people unequally. Notably, excluding these participants still yields a substantial A/B effect in study 1, t(388) = 11.3, p = 0.001, d = 1.14, and its replication (study 2a), t(354) = 7.30, p = 0.001, d = 0.78.

Although in their letter to PNAS MDS claim to have invalidated our results based on the so-called “minimum mean paradox” described above, they then go on to raise other unrelated objections to the studies reported in our paper for which (as we show in our forthcoming reply in PNAS) their analysis fails to deliver their desired result. We turn to those objections now.

6. Responses to MDS's criticisms of our health domain designs

MDS dismiss the Checklist scenario as “confounded” because the A/B condition differs from the policy conditions not only in involving a randomized experiment, but also in a second way: Only in the A/B condition, they say, are raters confronted with a decision to forgo a potentially life-saving intervention for no disclosed reason, whereas in the policy conditions, “there is only one option available.” But this is not true. When, in the A condition, the hospital director decides to implement a Badge policy for everyone, he necessarily forgoes (without explanation) a Poster policy (and any number of alternative or additive policies), any one of which might have superior life-saving powers than Badge, either as a substitute for Badge or as an addition to it. The fact that policy implementers rarely disclose alternative or additional policies they have foregone does not mean those policies are not “available” or that the policy implementer’s decision “caused” deaths any less than did the director who ran an A/B test.

We agree entirely that our A/B test descriptions necessarily make extremely salient the fact that some people will be denied Badges and others will be denied Posters, whereas people rarely take away from an announcement of a universal Badge policy the consequence that everyone is being deprived of Posters. But that is the nature of how A/B tests and universal policies are announced, respectively, in the real world, and that real-world setting is precisely what we attempted to capture in our vignettes (see our previous comments about external validity). To the extent that A/B tests are perceived to deprive people of important interventions while decisions to choose between those same interventions in the form of a universal policy implementation are not so perceived, that is a mechanism of the A/B Effect, not a “confound” to be avoided. [3]

MDS suggest that this “confound” could be avoided by modifying the scenario so that the agent in all three conditions considers two options and opts for only one, for no disclosed reason. Of course, that is exactly what we did in Studies 4, 5, and 6b. In Study 4, participants in all conditions (A, B, and A/B) were told that multiple FDA-approved blood pressure drugs exist, but Dr. Jones decides to give all of his patients “Drug A” (or “Drug B,” or conduct an A/B test of Drugs A and B to see which helps his patients most and offer them that one going forward). Similarly, in Study 5 (replicated in a health provider sample in Study 6b), all participants in all conditions are told that some doctors in a walk-in clinic prescribe “Drug A” to all of their patients while others prescribe “Drug B to all of their patients.” Dr. Jones then either decides—for no disclosed reason—to give all his hypertensive patients Drug A (or Drug B, or conduct an A/B test).

However, MDS have two reasons for dismissing the results of these scenarios. First, they say, only in the A/B condition is Dr. Jones uncertain about how to treat his patients. This is partially true—and entirely irrelevant. It is true in the sense that Dr. Jones, as an individual, demonstrates uncertainty only in the A/B condition, as do all agents who embark upon an A/B test (although we prefer to describe them as having epistemic humility). But all of the walk-in clinic doctors collectively demonstrate uncertainty about the best treatment for their patients. Participants rating all conditions are told that some doctors prescribe Drug A to all of their patients, while others prescribe Drug B to all of their patients—and these patients walk in off the street and are heretofore unknown to the doctors, so it is not the case that either patients or doctors are selecting one another in ways that would make a particular drug make more sense. Hence, this is a case, as we noted in PNAS, of unjustified variation in medical practices: the very common situation in which providers choose and assign to all their patients one treatment according to accidents of how they happened to have been trained, or which conferences they happen to have attended, and so on. The fact that experts and other agents are punished for having the epistemic humility to conduct an A/B test while agents who use their intuition to pick one policy for everyone are rewarded for being (overly) confident is a fascinating mechanism of the A/B Effect (we call it a proxy version of the illusion of knowledge in the PNAS paper), not a reason to dismiss evidence of the effect.

Second, MDS say that in the Best Drug scenarios, only in the A/B condition is Dr. Jones “doing something potentially illegal (running a medical trial without oversight).” First, although all of our scenarios were silent about things like consent, notice, and IRB oversight, some of our participants assumed (not unreasonably) that the agents in our vignettes had no intention of pursuing any of these things (we plan to conduct follow-up experiments in which we experimentally manipulate these aspects of the vignettes to isolate the extent to which they contribute to the effect).

As for whether Dr. Jones is doing something illegal, although pre-market drug (and device) trials are indeed heavily regulated by the FDA, this scenario involved drugs already approved by the FDA. In the U.S. (all our participants were U.S. residents), the primary source of regulation of Dr. Jones’s A/B test would be the federal regulations governing human subjects research (known as the Common Rule). The Common Rule only applies by federal law to human subjects research funded by one of several federal departments or agencies. Since Dr. Jones “thinks of two different ways to provide good treatment to his patients, so he decides to run an experiment,” it seems unlikely that he has an in-hand NIH grant to support this activity. Most research institutions do apply the Common Rule as a matter of institutional policy to all human subjects research conducted there, but failure to seek IRB approval wouldn’t constitute an “illegal drug trial.” Moreover, the Common Rule only applies to activities “designed to develop or contribute to generalizable knowledge.” The Common Rule unhelpfully does not define this phrase, and IRBs can and do arrive at different interpretations. An IRB could find that all of the vignettes we constructed constituted (unregulated) quality improvement (QI) activities (where neither IRB approval nor consent is required) rather than (regulated) human subjects research. Randomization does not automatically make an activity human subjects research and health systems can and do conduct “quality improvement RCTs” without IRB review or consent (e.g., Horwitz, Kuznetsova & Jones, 2019). Even when an IRB determines that an activity constitutes research, the Common Rule permits waivers for minimal risk research that would not otherwise be “practicable”—another undefined term that IRBs interpret differently, but can include scenarios where notifying patients that they are in a trial would would invalidate the results.

MDS acknowledge that they do not know whether Dr. Jones’s A/B test was illegal or not, and say that it is a problem that our participants probably don’t know, either. But to the extent that misconceptions about when things like consent and oversight are and are not legally (or ethically) required make randomized evaluation harder, then that is one more mechanism of the effect we are studying.

7. Observations on MDS’s presentation of evidence

Finally, we remark on some curiosities in the way MDS present their case against our PNAS paper. First, they focus exclusively on the seven scenarios we tested in our Study 3, dismissing the ones from our Studies 1, 2, 4, 5, and 6 as irrelevant for idiosyncratic reasons. This means they are dismissing the totality of our results purely on the basis of an empirical exercise they conduct on less than half of our experiments—coincidentally, the ones in which we found the smallest effects (including one we reported as a null effect). In their PNAS Letter, they justify this choice by claiming that our Study 3 is the most similar to the studies they conducted for their in-press Marketing Science paper. But mere similarity to one’s own work is not a scientific rationale for selecting what part of someone else’s work can stand in for the whole. As we show in our forthcoming PNAS Reply, replacing just 3 of the 7 scenarios MDS chose to present with 3 of the ones they left out renders the evidence for their own form of “experiment aversion” significant in our original PNAS data.

Second, MDS went to the trouble of recruiting nearly 100 MTurkers to provide evaluations of the 14 policies from our Study 3, but they didn’t take the easy extra step of having the same participants evaluate the A/B tests as well. That is, they stopped just short of carrying out the within-subjects experiment that would have definitively tested their claim.

Third, as far as we can tell, MDS did not preregister the hypotheses or the data-collection and analysis plans for their empirical exercise. (In our PNAS paper, except for Study 1, every pilot, main study, and replication we reported was pre-registered, and we reported every single study we had done on this topic.) This is especially odd because Uri Simonsohn is a proprietor of AsPredicted, a popular website and service for pre-registering studies.

Fourth, in the main text of their Data Colada blog post, MDS introduce their discussion of our Checklist and Best Drug scenarios (Studies 1, 2, 4, 5, and 6 from our PNAS paper) as follows: "Next, the discussion that did not fit in the PNAS letter: the concerns we had with the designs of the 2 scenarios we did not statistically re-analyze" (emphasis added). However, we recently noticed that point 4 in footnote 3 of their post contradicts this statement. That footnote implies that MDS carried out their empirical exercise using nine of our scenarios, not just the seven they report in their PNAS Letter and the main body of their post, and found data supporting our hypothesis rather than theirs. As shown below, that footnote states "... in what we refer to as the 8th and 9th scenarios in the PNAS paper, the only two where in our re-analyses, experiments were objected to more than to their worst individual policy" (emphasis added).

There is no other mention of those other “re-analyses” in their PNAS Letter, blog post, or corresponding OSF archive, so we are confused.

8. Conclusion

We regret that our dispute with MDS has resulted in adversarial blog posts and PNAS Letters. We have been fans of much of Simonsohn’s previous work and of his blog. We found MDS’s in-press paper intriguing, and we stated publicly that we looked forward to doing studies that might explain why they reached different conclusions from us.

After MDS’s Letter was sent to PNAS, and after we submitted to PNAS our requested Reply, we discussed with Simonsohn some alternatives to this public argument, such as writing a joint commentary, or editing each other’s letters to resolve as many disagreements as possible before publication, or even starting an adversarial collaboration (e.g., Mellers, Hertwig, & Kahneman, 2001) to collectively gather new data that might sort things out. We remain open to working or at least having constructive discussions in the future with MDS and anyone interested in this topic.


[1] PNAS first contacted us on August 20, 2019, to tell us that MDS had submitted a Letter and to request our response. Our first within-subjects experiment was pre-registered on July 29 (materials testing preregistered on July 26). In that experiment, we showed and asked participants to rate all three conditions from the Meyer et al. Checklist scenario—but one at a time (in randomized order). After sequentially rating each of the director’s three choices, we invited them to revise their ratings after they were exposed to all three options. The results of this “Checklist—sequential” experiment support the A/B Effect and experiment aversion more narrowly, but we plan to report them in the Supplemental Information that will eventually accompany the preprint. Next, on August 12, 2019, we conducted a within-subjects experiment of Checklist in which participants were simultaneously shown and asked to rate all three decisions, presented in randomized order (materials testing preregistered on August 8). These results are reported in our preprint. On September 2, we preregistered a “simultaneous” within-subjects test of the Best Drug: Walk-In scenario from Meyer et al. (materials testing preregistered on September 1), also reported in the preprint. Finally, on October 15, we preregistered “simultaneous” within-subjects tests of three Study 3 scenarios from Meyer et al.: Direct-to-Consumer Genetic Testing, Autonomous Vehicles, and Retirement Savings. We preregistered and conducted materials testing for all three scenarios on October 14, but to date, we have only conducted a full experiment with the DTC Genetic Testing scenario, reported in the preprint. We will conduct the full experiments with Autonomous Vehicles and Retirement Savings soon.

[2] In these within-subjects experiments, we also asked participants to rank the agent’s three options (A, B, and A/B test) and in each of the three scenarios we tested, we find large portions of participants who rate the A/B test as the decision-maker’s worst option. We observe this, even though an A/B test necessarily assigns only 50% of people to the rater’s least-preferred policy—which ought to be superior to the option of assigning 100% of people to the rater’s least-preferred policy, if people’s dislike of experiments only ever reflects their dislike of the worst policy it contains.

[3] We take the main thrust of MDS's objection to Checklist to be the alleged "confound," in which A/B raters, but not policy raters, are confronted with "causing deaths for a trivial benefit (e.g., saving cost)." For the sake of comprehensiveness, we note that in the real world, especially in the non-profit world of hospitals and health systems that operate on the slimmest of margins, resources are often extremely constrained. Reprinting and redistributing all provider badges—or printing and hanging posters—is not a "trivial" decision, especially if scaled across a large system. Moreover, even assuming a magical world in which money is no object, more interventions are not always superior to fewer interventions. Finally, two or more treatments are often mutually exclusive (as is the case with blood pressure Drug A and Drug B in our Best Drug scenarios).