With alarming regularity, many promising pilots in the health care improvement and implementation field have little overall impact when applied more broadly. For example, following early reports that care coordination programs benefit patients and reduce costs, a 2012 Office of the Inspector General (OIG) report found no net benefit, on average, across 34 care coordination and disease management programs on hospital admissions or regular Medicare spending. In 2014, Friedberg and colleagues found patient-centered medical homes (PCMH) in Pennsylvania had no impact on utilization or costs of care, and negligible improvement in quality, despite early reports promising decreased costs and improved quality of care.

Moreover, surgical checklists have been adopted globally after early findings of improvement in surgical outcomes. In particular, in Ontario, policies have called for universal adoption of checklists. However, recently Urbach and colleagues compared operative mortality and surgical complications at 101 acute care hospitals in Ontario, finding that a surgical checklist implementation was associated with no overall positive effects.

This phenomenon was described in the 1980s by American program evaluator Peter Rossi as the “Iron Law” of evaluation in studies of the impact of social programs, arguing that as a new model is implemented widely across a broad range of settings, the effect will tend toward zero. As a result, policymakers are unsure whether or not to encourage model expansion.

In this post we describe how new models can fall foul of Rossi’s Iron Law in several interdependent ways, and we recommend approaches to reduce the likelihood of this happening. Specifically, we argue that just because a pilot does not work everywhere does not mean it should be wholly abandoned. Instead, data should be reviewed at the individual site level. Evaluators and policymakers should understand where and why a model has worked and use that information to guide the adaptation and spread of a model.

Cargo-Cult Quality Improvement

One explanation for why new models implemented more widely tend to show limited or negligible impact is that, after initial testing and implementation in a few places, the full complexity of an innovation may not be fully understood. Too often, a simplified version of the change, often as a fixed protocol, is described with no reference to the core ideas that underpin the change model. Subsequent spread of this fixed protocol across diverse settings then results in limited overall impact. Studying the successful Keystone Project in Michigan, that reduced central line bloodstream infection rates, Dixon-Woods described the problem of  an over-simplistic change model that omits describing the ‘active ingredients’ to users and coined the term “cargo-cult quality improvement” in reference to a 1974 commencement address by Richard Feynman at Caltech, who said:

“In the South Seas there is a Cargo Cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to make things like runways, to put fires along the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas—he’s the controller—and they wait for airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things Cargo Cult Science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.”

One example of cargo-cult quality improvement is described in a commentary by Lucian Leape on the Canadian implementation of the surgical checklist. Leape suggests that the full checklist approach may not have been applied, because educational components were omitted. He reminds us that the act of checking off a box does not lead to an effective checklist. Rather, effective checklist use may be due to the underlying core concepts of improved team work and communication, facilitated by checklist completion. Avoiding cargo-cult quality improvement requires that the activating core concepts of new models are well specified before policy mechanisms are used to replicate them widely.

Building Fordlandia

Faced with rising costs of rubber supplied from Asia, Henry Ford wanted to gain a cheaper supply for his car manufacturing business. During the 1920s, Ford built a model town, Fordlandia, in the Amazon rainforest in Brazil. The town was based on American architecture, town planning styles, and governed according to Ford’s particular American values. Fordlandia employed indigenous people on the rubber tree plantations. However, the local environment did not suit the growth of rubber trees, and were susceptible to pests. Moreover, the indigenous population were uncomfortable with the imposed structure and culture — leading to revolts and strikes. These factors led to the failure of Fordlandia, which was sold at a loss by Ford in 1945.

Fordlandia illustrates the danger of assuming a seemingly well understood model will work everywhere, or that the effect of a model is independent of the local context, culture, and wider macro-environment. If this model is implemented in increasingly varied settings or contexts from those in which the first positive effects were found, it is likely that the model may not be appropriate to the new settings and that smaller and smaller-effect sizes will be found.

For example, although the 2012 OIG report cited above found no average positive effects from care coordination programs on hospital admissions, four programs produced greater than or equal to 15 percent reductions in admissions. In the checklist study, while no hospitals had significant changes in mortality, six showed significantly fewer complication rates, and three showed higher complication rates. Assuming a “one-size-fits-all” approach and ignoring the variation across sites in culture, readiness, and macro-environmental effects risks that policymakers will question the value of the entire concept and cease investment in them.

Ignoring The “How”

Another reason new models fail to thrive when scaled up is persistent confusion between the effectiveness of a new model and that of the implementation approach. Even if we understand the core concepts of a model and where it is likely to work, we still need to understand how to move from the current state to the new state. Consequently, a spread or scale-up approach that merely distributes a description of the model, without details or examples of how the changes should be implemented, is unlikely to be adopted effectively. Incomplete or incorrect adoption confounds the results — suggesting the intervention has no impact, when in fact it was poorly applied.

For example, as described above, Dixon-Woods’ study of the CLABSI reduction work in Michigan demonstrated the importance of knowing the core concepts of a model. Moreover, Dixon-Woods also described the sociological aspects of the work that were crucial in implementing the clinical interventions. The “Matching Michigan” initiative launched by the UK’s National Health Service to replicate Michigan’s CLABSI reductions used a similar set of clinical interventions that had worked in the US. However, the UK’s implementation approach did not match that described by Dixon-Woods. For example the UK approach did not employ standardized data collection, which may have been a contributory factor to the limited effectiveness of the intervention.


The final way to fall foul of Rossi’s Iron Law is to turn a model and implementation strategy into a fixed protocol, insisting that any effort to spread or scale-up the model follow that rigid protocol. Such dogmatic approaches are often accompanied by a single summative evaluation, often in the form of a randomized controlled trial, with the results revealed at the end of the implementation phase. As we have discussed above, many models may work only in certain contexts.

For example, the MERIT randomized controlled trial, found that the use of rapid response teams to treat deteriorating acute care patients was not effective. However, MERIT used a fixed-protocol approach, where all 12 intervention sites were expected to employ the same model in response to a similar signal from a patient. Related studies demonstrate substantial variation in how rapid response teams are put into practice. The findings generated from a fixed-protocol implementation imposed by an orthodox evaluation design excluding sites that amend the model, will likely have limited real-world application.

Reducing The Effect Of The Iron Law

There are ways to reduce the impact of the four primary causes described above. Start by specifying the core concepts of a new model. Then, if a model is appropriate for spread or scale-up, take a more pragmatic approach and encourage settings to start from the core concepts, and allow for local iterative adaptation of the models. The guiding evaluation question should then be to ask, “In what contexts does a new model work or can be amended to work, and with what impact?” With those questions in mind here are specific actions that stakeholders can take during a model’s early testing, implementation, and expansion phases.

Early testing and implementation

To move beyond cargo-cult quality improvement, policymakers must develop a good understanding of the detailed tasks undertaken as part of a new or innovative model, along with the theory that underpins them. The concepts, not the detailed interventions, are more likely to be generalizable or portable to other settings. For example, “planning post-discharge care transitions” may be an important concept for reducing readmissions, but its application may need to be tailored in a variety of ways to a local context or certain patient populations. Asserting that everyone, everywhere should have a follow-up primary care appointment within seven days is not helpful, or feasible.


Policymakers should encourage a pragmatic approach to model expansion that starts with a focus on the concepts and theories underpinning the innovative model. The aim should be to provide methods and tools for people to assess whether or not the model can be adapted to their setting — including the detailed tasks and pre-conditions required to do this locally. In other words, expect that models will need to be adapted as we learn more about where they work and where they don’t work. For example, Kirsh and colleagues describe the need to tailor a diabetes intervention to a local setting, such as reconfiguring primary care clinics and related services at a local level.

Recognizing these issues gives policymakers an opportunity to more effectively guide investment in health reform initiatives. This is especially relevant today as many promising and innovative new models of care delivery and payment reform are being tested through the Center for Medicare and Medicaid Innovation (CMMI). Some models may be implemented and expanded nationally through rule-making by the Secretary of Health and Human Services or further health reform legislation.

Just as we recommend above, the CMMI evaluation framework includes both formative and summative assessment. In other words, it is generating learning on how the new models can be adapted to local settings. At CMMI, this framework is being used prospectively so that lessons learned can be applied to ongoing improvement and enable mid-course adjustments. This approach will also inform the processes used to expand models beyond the initial testing sites that show promise.

There is no simple one-size-fits-all evaluation solution. We must challenge the demonstration and evaluation communities to develop methods that predict where new models will work best and what the cost and quality benefits will be. As a recent commentary by Jha and Pronovost pointed out, the quality improvement movement in health care must continue to evolve from “just knowing” to actively participating in policy debates.

However, the research community also needs to evolve from one where the default design is impact assessment through randomized controlled trials. Rossi’s Iron Law is a stark reminder that orthodox evaluation techniques are not appropriate for dealing with the complexities associated with health system improvement and transformation over time — particularly as it relates to the large-scale spread and expansion of models tested in smaller, more controlled settings.