While participation in Medicare accountable care organizations (ACOs) continues to grow—9 million Medicare beneficiaries are currently attributed to Medicare Shared Savings Program (MSSP) ACOs alone, up 1.3 million since 2016—controversy swirls around their impact as analysts disagree about the success of the model. Some argue that ACOs have saved money (albeit a small amount) and that they could realize greater savings as the years go on. Others believe that there have been no savings at all from ACOs and that reform requires a new approach.
Some of this disagreement arises from the commentators’ differing conceptions of what the ACO would have spent were it not under the ACO contract. Indeed, one of the main difficulties in evaluating the success of the ACO project is choosing an appropriate counterfactual scenario against which to judge ACO performance. While we can observe actual spending for ACO beneficiaries, it is impossible to observe the counterfactual. Therefore, it’s very hard to know what would have been spent if ACOs did not exist. It is important to avoid using the Centers for Medicare and Medicaid Services’ (CMS’s) benchmarks to evaluate ACO savings because benchmarks are constructed with policy goals in mind. Benchmarks are not designed to reflect the counterfactual scenario in which the providers were not under the ACO contract.
Benchmarks Are Not True Counterfactuals
It is tempting to calculate savings by comparing actual spending to the ACO benchmarks set by CMS, particularly because benchmark data are publicly available and used by CMS to calculate shared “savings.” These savings estimates are also commonly—although not always—what CMS uses when it reports savings. So, this is often taken to imply that benchmarks are valid counterfactuals. But these “savings” are not intended to provide a broad conclusion about the impact of ACOs on spending from a policy perspective. In fact, the benchmarks are (and should be) set to accomplish a number of policy objectives (for example, to induce participation, to establish incentives for ACOs to lower spending, and to meet program fiscal goals). For example, if ACOs’ benchmarks were set at the true counterfactual (what ACOs would spend in the absence of the ACO program), inefficient ACOs would have higher benchmarks than other ACOs. This is a policy concern CMS may want to remedy by having benchmarks converge. In fact, CMS recently changed the benchmarking rules such that benchmarks in each region will converge as benchmarks are re-set each contract period toward a regional average starting in an ACO’s second contract period, rendering benchmarks poor counterfactuals.
Even if CMS’s intention is to set benchmarks at ACOs’ expected spending in the absence of efforts to lower spending, the design of the benchmarks in the Medicare ACO programs are, by construction, systematically invalid counterfactuals. Specifically, the benchmark for an MSSP ACO entering its first contract period is calculated based on historical spending for ACO-assigned beneficiaries from the three years prior to the start of the contract. A weighted and risk-adjusted average of spending during these baseline years is then trended forward using concurrent national average spending growth in Medicare Parts A and B. That is, the same national increase is applied to establish ACO benchmarks in all regions. This yields a faulty estimate of savings in almost all cases for a given ACO because rates of spending growth in Medicare vary widely across geographic areas. So, judging savings by comparing spending to the benchmark arbitrarily favors ACOs in areas where health care spending grew slower than the national average. ACOs in low-spending growth rate areas could “save” simply by maintaining the status quo trend in the region because benchmark growth will outpace spending growth in that region by definition. Similarly, ACOs in high-spending growth regions will appear to fail, even if they saved money relative to their non-ACO neighbors.
Use Of Benchmarks Likely Underestimates Savings From ACOs
This feature of ACO benchmarks has also caused systematic underestimation of savings across all ACOs because MSSP participation has been disproportionately higher in areas of faster Medicare spending growth. Based on our analysis of Medicare claims data from 2012 through 2014, for every 0.5 percentage point higher rate of spending growth among non-ACO providers in a hospital referral region (HRR), ACO penetration in the HRR by 2014 was 25.0 percentage points higher.
Another under-recognized feature of ACO benchmarks that causes systematic underestimation of savings is that ACOs affect the national spending growth rate used to set ACO benchmarks. If, for example, ACOs lower Medicare reimbursements by 3 percent (as the 2012 entry cohort did by 2014), and if one in three fee-for-service Medicare beneficiaries are in ACO programs (as is currently the case), then national spending growth in Medicare spending would be 1 percent slower because of ACOs, and benchmarks would be 1 percent lower than the correct counterfactual (the expected level of spending in the absence of ACO-related spending reductions).
Quasi-Experimental Evaluations Are Critical
If we want to know the actual impact of the ACO programs, we must therefore rely on rigorous quasi-experimental evaluations that establish a plausible counterfactual, and not one that is guaranteed to be wrong. In evaluations that we and others have conducted, the counterfactual is set at an ACO’s expected level of spending based on its baseline spending and the concurrent spending trend in its service area among a control group of non-ACO providers. Such an approach is not without drawbacks. For example, ACOs serve Medicare patients not attributed to them under their ACO contracts (that is, patients in the control group). If there are strong spillovers, using regional spending trends to establish counterfactuals could underestimate savings.
These analytic choices make a difference when it comes time to judge the ACO program’s level of success and develop policy recommendations. Using the benchmark as the measure of success, analysts have concluded, for example, that CMS actually lost money (on the order of $200 million) in the ACO program and that it was a bad deal for CMS. Contrast this with a conclusion based on the use of a counterfactual designed for research purposes (and based on regional spending trends):
“In 2014, aggregate spending reductions across all 3 cohorts exceeded bonus payments, constituting a net savings of $287 million to Medicare, or $67 per ACO-attributed beneficiary (0.7% of total spending for ACO-attributed beneficiaries…suggesting that shared-savings contracts without downside risk for excess spending—in which 95% of MSSP ACOs currently participate—may be a fiscally viable alternative payment model for Medicare.”
This is similar to the conclusion reached by CMS analysis using a research-derived counterfactual, as opposed to a program-derived benchmark intended to establish incentives. Another finding that has emerged from evaluations is that the savings are growing with longer program participation, as many expected. Thus, averaging the first-year savings among recent entrants with the third-year savings of earlier entrants yields a lower programwide estimate of savings that may not reflect what we should expect from the program once participation has reached its maximum.
Constructing a valid counterfactual is difficult, but it is necessary, and analysts will continue to disagree about the best approach. However, thoughtful analysis of the effects of any program, including the ACO program, must avoid the temptation to use easily available figures if they are not designed to capture the true program effects. Evidence is critical for evidence-based policy; what we call evidence therefore matters. The bottom line is that we should not pay attention to any analysis that uses the benchmarks as the basis for assessing the effects of ACOs on Medicare spending. The benchmarks set the incentives and must be used by CMS to calculate shared savings bonuses, but they should not be used to evaluate impact.