The Replication Crisis Nobody Told Your Psychiatrist About

In 2011, a psychologist named Daryl Bem published a paper in the Journal of Personality and Social Psychology claiming to demonstrate precognition. Nine experiments, over a thousand participants, statistically significant results suggesting that human beings could perceive the future. The paper passed peer review at one of the most prestigious journals in the field. It used the same methods that every other social psychologist was using. The statistics were clean by the field’s own standards.

The paper was, almost certainly, wrong. But the methods that produced it were not aberrant. They were standard. And that was the problem. If the field’s accepted methodology could produce evidence for psychic powers, what else had it produced evidence for that wasn’t real? The question detonated a decade of reckoning that has still not reached the offices where your psychiatrist sits, clipboard in hand, telling you what you have.

The Numbers Are Worse Than You Think

The Open Science Collaboration published its results in 2015. A team of 270 researchers attempted to replicate 100 studies from three major psychology journals. The findings were devastating in a way that the field has still not fully absorbed. Of the original studies, 97% had reported statistically significant results. Of the replications, 36% did. The average effect size in the replications was roughly half of what the original studies had claimed.

This was not a fringe result about fringe research. These were studies from Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition. The bread and butter of the discipline. The studies that textbooks cite, that review articles reference, that downstream clinical interventions are built on.

The replication crisis, as it came to be called, has a straightforward explanation that is genuinely difficult to accept: the incentive structure of academic psychology systematically produces unreliable findings. Publication bias means journals overwhelmingly publish positive results. Researchers who find nothing don’t get published; researchers who find something do. This creates a selection filter where the published literature is enriched for false positives in the same way a rigged deck is enriched for aces. Not because anyone is cheating, necessarily. Because the game is designed so that only certain cards make it to the table.

Add to this the practice of p-hacking, which is less exotic than the name suggests. A researcher collects data and runs analyses until something crosses the p < .05 threshold. Exclude a few outliers. Try a different statistical test. Split the data by gender or age and see if one subgroup shows an effect. Each of these decisions is defensible in isolation. Taken together, they transform the statistical machinery from a tool for detecting real effects into a tool for generating publishable noise. Simmons, Nelson, and Simonsohn demonstrated in 2011 that using common “researcher degrees of freedom” on completely random data could produce statistically significant results over 60% of the time. The threshold that was supposed to guarantee a 5% false positive rate was, in practice, guaranteeing nothing. The lock on the door was decorative. And everyone had a key.

Clinical Psychology Inherited the Wreckage

Here is where this stops being an academic scandal and starts being your problem. Clinical psychology does not exist in a separate universe from research psychology. The studies that don’t replicate are not confined to laboratories where undergraduates sort word lists. They feed directly into the clinical pipeline.

Ego depletion, which informed models of willpower and self-regulation used in cognitive behavioral therapy, failed to replicate in a major preregistered study in 2016. The facial feedback hypothesis, which suggested that smiling makes you happier and was incorporated into therapeutic interventions, failed to replicate. Stereotype threat, which reshaped educational and clinical approaches to test anxiety and performance, has shown dramatically smaller effects in replication attempts than the original studies claimed. Power posing, which Amy Cuddy promoted from a TED Talk into clinical and coaching applications, was built on a study whose own co-author later said did not hold up.

These are not obscure findings. They were woven into the fabric of how therapists understood their clients and how interventions were designed. When the evidence base underneath them collapsed, the clinical practices built on top mostly kept going. Nobody sent a memo.

The lag between research and practice is not unique to psychology; it takes an estimated seventeen years for medical research findings to reach clinical application. But the replication crisis adds a darker layer to this already grim timeline. It is not just that new findings are slow to reach practitioners. It is that the old findings already in use were never as solid as they appeared. The pipeline isn’t just slow. Parts of it are carrying contaminated material.

The situation in psychiatry is arguably worse, because the stakes involve medication. The selective publication of pharmaceutical trial data has been documented so thoroughly that it barely qualifies as controversial anymore. Erick Turner’s 2008 analysis of antidepressant trials submitted to the FDA found that 94% of published trials showed positive results, while only 51% of all trials (published and unpublished) were positive. The published literature made the drugs look roughly twice as effective as they actually were. Not because the drugs don’t work at all; there is real, if modest, evidence for antidepressant efficacy. But the magnitude of the effect that informed prescribing decisions was substantially inflated by a publication system that buried the failures.

The DSM Was Never Built on Replicated Science

The Diagnostic and Statistical Manual, which is the architecture of modern psychiatric diagnosis, was not constructed through the process that the replication crisis exposed as broken. It was constructed through a process that was never scientific to begin with.

DSM categories are created by committee. Panels of experts debate the boundaries of disorders, negotiate the number of symptoms required for a diagnosis, and vote on what counts. The process is closer to legislation than experimentation. Robert Spitzer, who led the creation of DSM-III and transformed the manual from a psychoanalytic document into a symptom-checklist document, was explicit about the fact that the categories were designed for reliability, not validity. The goal was to get clinicians to agree on what to call things, not to ensure that the things they were naming corresponded to distinct biological entities.

This worked, in the sense that diagnostic reliability improved. Two psychiatrists looking at the same patient could now agree on the label more often than before. But reliability is not validity. Two astrologers can reliably agree that someone born on March 15 is a Pisces. The reliability tells you nothing about whether Pisces is a real category that carves nature at its joints.

The field studies for DSM-5 produced reliability numbers that were, by the manual’s own published data, troubling. Major depressive disorder achieved a kappa of 0.28, which falls in the range that research methodologists call “fair” and ordinary people would call “barely better than chance.” Generalized anxiety disorder came in at 0.20. Mixed anxiety-depressive disorder was -0.004, which is literally indistinguishable from random agreement. These are the diagnoses that millions of people carry; that determine what medication they’re prescribed; that shape how they understand their own suffering.

The consequences extend beyond individual prescriptions. The entire apparatus of clinical guidelines, treatment algorithms, and “evidence-based practice” rests on meta-analyses that pool results from published studies. If the published studies are a biased sample of all studies conducted; and they are; then the meta-analyses are biased too, and the guidelines built on those meta-analyses are transmitting inflated confidence from the research literature directly into the protocols that govern patient care. The whole chain, from study to publication to meta-analysis to guideline to prescription pad, is contaminated at the source.

Why None of This Has Changed Clinical Practice

The replication crisis broke open in psychology around 2011-2012. It has been over a decade. Preregistration is now common in research. Open data and open materials are becoming norms. The methodological reforms are real and meaningful. But clinical practice has barely budged.

The reason is structural, not conspiratorial. Clinical training programs teach what textbooks say. Textbooks cite the published literature. The published literature is contaminated with inflated effect sizes and unreplicated findings. Updating the textbooks requires someone to systematically identify which findings hold and which don’t, and no one has the incentive or the infrastructure to do this at scale. Meanwhile, clinicians who were trained ten or twenty years ago are practicing based on what they learned, and continuing education requirements don’t typically include “here is the list of things we taught you that turned out to be wrong.”

Insurance reimbursement requires diagnostic codes. Diagnostic codes come from the DSM. The DSM categories were not built on the kind of evidence that the replication crisis says we should demand. But dismantling the DSM would require dismantling the reimbursement system, which would require dismantling the insurance system, which nobody is going to do because a bunch of psychologists discovered that ego depletion doesn’t replicate.

There is also a straightforward human problem. A clinician who has spent twenty years diagnosing depression based on DSM criteria and prescribing SSRIs based on published trial data cannot easily absorb the message that the diagnostic category is unreliable and the evidence for the medication was inflated. This is not because clinicians are stupid or defensive. It is because the alternative is vertiginous. If the foundations are shaky, what exactly are you supposed to do on Monday morning when a patient walks in and says they can’t get out of bed? You can’t tell them to wait for better science. You can’t prescribe epistemological humility. You do what you were trained to do, because that is what you have.

The Honest Position Is Uncomfortable

The replication crisis does not mean that all of psychology is wrong. It does not mean that antidepressants don’t work or that therapy is useless or that mental suffering isn’t real. The people who draw those conclusions are making the same mistake in reverse; replacing one oversimplification with another.

What the replication crisis actually means is something harder to sit with: the confidence level assigned to most psychological and psychiatric knowledge is too high. The findings are less robust, the effect sizes are smaller, the diagnostic categories are fuzzier, and the evidence base for specific interventions is thinner than the published literature suggests. None of this means zero. It means less. And “less” is a genuinely difficult quantity to work with when you’re a clinician who needs to do something, or a patient who needs to understand what’s happening to them.

The methodological reformers in psychology deserve enormous credit for doing the unglamorous work of actually checking whether the field’s claims hold up. But the gap between what the reformers have discovered and what the average practicing clinician knows remains vast. Your therapist probably hasn’t read the Open Science Collaboration’s replication results. Your psychiatrist probably hasn’t read Turner’s analysis of antidepressant publication bias. Not because they’re negligent, but because there is no mechanism in the system to deliver this information to them in a form that changes what they do.

The patient who walks into a psychiatrist’s office in 2026 is walking into a practice built on an evidence base that the field’s own methodologists have spent a decade demonstrating is unreliable. The diagnosis they receive was created by committee vote, not empirical discovery. The medication they’re prescribed was approved based on a published literature that systematically overstates efficacy. None of this means they won’t be helped. Many people are helped. But the help is happening inside a system that is far less certain about what it knows than it presents itself as being.

That gap between the system’s confidence and the evidence’s strength is not a detail. It is the central problem of modern mental health care. And almost nobody on the clinical side is talking about it, because the clinical side was designed to deliver answers, and “we’re less sure than we thought” is not an answer anyone is set up to give.