Getting Clear on the Replication Crisis

22 January 2019

Ever since the replication crisis broke in around 2011, a number of causes—more and less nefarious—have been identified for why a psychological experiment (or other experiments) might not replicate.

对于那些错过它的人来说,复制危机是心理学上的发现,该领域的许多明显重要的结果并没有复制。也就是说,当独立研究人员做了与已经发表的实验相同的实验时,之前发现的结果并没有实现。复制危机的加剧很大程度上要归功于弗吉尼亚大学的布莱恩·诺塞克(Brian Nosek)领导的复制努力。他和其他270名研究人员重新研究了2008年发表的100项最著名的研究结果。他们发现,重做的实验中只有30-40%的结果与原始结果相同(这取决于对数据采用哪种统计检验)。

When it comes to the root causes of the crisis, the most important problem (out of several) seems to have been a bias on the part of journals for publishingeye-popping results, combined with a too-lenient standard of statistical significance. In the past, experimental data only had to pass the significance test ofp< 0.05 in order to be publishable. An experiment whose data achieve that level of significance has a 5% chance (or less) of having gotten its apparent effect by accident. That might seem good. But think of accidentally passing that test for statistical significance as rolling a 19 on a 20-sided die; even if your experiment isn’t meaningful, you might have gotten lucky and rolled a 19. So with many thousands of studies being done each year, the consequence was that hundreds of studies hit that level of significanceby chance. And many of those hundreds of studies also had eye-popping “results,” so they got published. There are several easy-to-follow overviews of the replication crisis available (e.g.,hereandhere), so I won’t expand on the general situation.

Rather, I want to go into a different possible reason why a study might fail to replicate, one that seems to have been mostly overlooked, namely: lack of conceptual clarity about the phenomenon being measured. This could be lack of clarity in the instructions given to participants or even in the background thinking of the experimenters themselves.

One of the more famous studies that didn’t replicate was published inSciencein 2012.Will Gervais and Ara Norenzayandid five experiments that seemed to show that priming participants to think in analytic ways lowered their levels of religious belief. For example, getting participants to do word problems in which analytic thinking was needed to override intuitions apparently lowered their level of belief in supernatural agents, like ghosts and spirits. Gervais and Norenzayan even found that showing participants images of Rodin’s sculptureThe Thinker(pictured above) lowered measured levels of religious belief. The topic of such studies is exciting enough to get published in a top journal: if this were a real phenomenon, then just getting people tothink再难一点就会开始破坏他们的宗教信仰。

Too bad attempts to replicate Gervais and Norenzayan’s findings came upempty handed. Now, to be fair, there is a real phenomenon in the ballpark that has turned up invarious studies: people who have a more analytic cognitive style in general—meaning they’re prone to using logically explicit reasoning to overrule intuitions—are less religious. But that’s a finding about a general personality trait, not the effect of in-the-moment priming.

Applying my suggestion to Gervais and Norenzayan’s studies we can ask the following: how might lack ofconceptual clarityon the part of participants help engender a situation that’s ripe for replication failure? There are two things to say. First, lack of conceptual clarity may lead to a situation in which there is no determinate phenomenon that’s being probed. If that’s the case, then a given experiment, though it might seem to be about a determinate topic, is really just another roll of the 20-sided die. But that roll might come up 19, so to speak, in which case it could well get published. But second—and more subtly—in the presence of conceptual unclarity on the part of participants, irrelevant influences (like idiosyncrasies of a particular experimental location) have a better chance to sway participants’ answers, since they don’t have a clear grasp on what’s being asked. Irrelevant influences mean that factors about a particular experimental situation prompt participants to answer one way rather than another, even though those factors have nothing to do with the topic being researched. If that happens, then when the study is replicated in a situation where those irrelevant influences are absent, the apparent effect is likely to disappear.

Take Gervais and Norenzayan’s Study 2, for example—the one in which participants looked at an image of Rodin’s Thinker and then reported on their level of religious belief. Gervais and Norenzayan found that their participants who saw an image of The Thinker reported lower levels of religious belief than their participants who saw images not associated with analytic thinking. But what was their measure of level of religious belief? They write, “a sample of Canadian undergraduatesrated their belief in God (from 0 to 100)在被随机分配观看四张图片后,要么是描绘反思姿势的艺术作品,要么是与表面特征相匹配的控制艺术作品。

Now ask yourself: what does a 56 point as opposed to a 67 point rating of belief in God (out of 100) even mean? There are various things that might come to mind when such a rating is requested. One might take oneself to be rating how confident one feels that God exists; that means giving anepistemicrating to the belief. Or one might take oneself to be rating how central “believing” in God is to one’s identity; this means giving asocialrating to the belief. And these two things come apart. Many devout Christians, for example, admit in private that they wrestle with doubts about God’s existence (low epistemic confidence), even as they dutifully attend church every week (high degree of social identity). And without clarity about what their answers even mean, participants might well be pushed and pulled by irrelevant contextual factors that create some sort of response bias (the placement of response buttons on their keyboards, or whatever).

It is, I confess, only a guess to say that lack of conceptual clarity on the part of the participants was implicated in the not-to-be-replicated data in Gervais and Norenzayan’s original study. Furthermore, measures can often be meaningful even when—or sometimes especially when—participants have no idea what’s being measured. But it’s fair to say that in at least some cases, lack of clarity on the part of participants about what’s being asked of them can lead to confusion, and this confusion opens the door for irrelevant influences.

The good news about the unfolding of this crisis so far is that many researchers in psychologygot their acts together主要有三个方面。首先,研究人员现在使用更严格的统计测试来确定他们的结果是否值得发表。第二,预先登记学习已经成为标准。预先注册是指研究人员在做实验前预先记下他们的方法和分析将是什么,并将文件提交给第三方数据库。Pre-registration is useful because it helps preventp-hackinganddata-peeking. And third, it’s become more common to attempt to replicate findingsbeforesubmitting results to a scientific journal. That means that a given experiment (or one with variations) gets done more than once, and the results need to show up in each iteration. Thus, researchers are checking themselves to make sure they didn’t get single-experiment data that just happen to pass tests for statistical significanceby accident.

The moral of the present discussion is that there’s an additional area of room for improvement. Sometimes a philosophical question needs to be asked in the process of experimental design. If you’re an experimenter who wants to ask participants to give responses of type X, you would do well to ask yourself, “What does a response of type X even mean?” If you’re not clear about this, your participants probably won’t be either.

Comments(1)


Harold G. Neuman's picture

Harold G. Neuman

Wednesday, January 23, 2019 -- 2:12 PM

Undoubtedly, there is good

毫无疑问,任何基于科学的研究都有充分的理由证明结果的可复制性——这一点在你的文章中已经阐述得很好了,在其他地方也已经存在很长时间了。我想这表明,正如之前提到的世界杯赛程2022赛程表欧洲区,哲学之间的跨学科关系;心理学;精神分析;生理学;和神经科学。相对而言,我对哲学知之甚少,对心理学更是知之甚少,但至少对我来说,哲学演讲主要是关于哲学的——这就是我喜欢它以及它所鼓励的活跃参与的原因。中国伊朗亚洲杯比赛直播如果博客的主要目的是促进哲学思想和论述,那么就让我们继续下去吧。如果心理学值得有一个自己的博客,为什么不建立一个呢?你可以叫我纯粹主义者或白痴,如果你喜欢的话,但我不会对这个博客特别感兴趣,如果它被称为心理学谈话。 Just sayin'...