A Comparative Effectiveness Research (CER) study shows that surgery is better than medical treatment for a particular cardiac condition. My patient is 78 years old and has complicated diabetes. – does the study apply? Another patient 48 years old and otherwise healthy. Does it apply here?

Can the overall results of a CER study be applied to all patients in the target population? Are there substantial, undetected variations among patients in the results of CER? What is the extent of exceptions? These are important policy questions in applying results of CER to day-to-day decisions, clinical guidelines, performance measures and other facets of the modern healthcare system.

The “gold standard” approach to CER is the randomized (RCT), a scientific comparison of two or more clinical strategies, with the downsides that it is generally conducted in a special environment and usually has a rather narrow (and possibly unrepresentative) population spectrum. Two variants, the Practical (or Pragmatic) Clinical Trial (PCT) and the Large Simple Trial (LST) are inclusive of a wider spectrum of patients and more diverse clinical settings.

These approaches provide “average” results and for the most part it is thought that averages do apply to a large segment of the population at large for which they are intended. However, there are clearly differences in effect (heterogeneities of treatment effect – HTE’s) that manifest among CER study subjects and presumably to a greater extent in the intended population outside the study. Two approaches may be equivalent on the average but one may be better in a particular group, and differences may be less apparent when the study’s population base is narrow. A long list of factors contribute to these HTEs for CER and other trials – comorbidities, severity of illness, genetics, age, medication adherence, susceptibility to adverse events, ethnicity, site, economics and others.

Important metrics for HTEs are variation in risk and the benefit/risk ratio. Variations in baseline risk can be substantial among risk quartiles within studies, e.g. over 10 fold in a group of heart attack studies and 70 fold in studies of kidney disease progression.

There is a balance between risk and benefits.  In general the main (at times the only) benefit of an intervention is for those in the highest risk quartile. At the lowest risk quartile or in those with diminished benefit (such as later rather than earlier treatment of stroke with clot dissolving tPA therapy), the benefit/risk may be modest, absent or possibly, if there are significant adverse effects, opposite. In reviews of trials, primary angioplasty for heart attack compared to medical tPA benefited the 26 percent at highest risk and was estimated to have a likely benefit threshold of about a 2.0-2.6 percent 30 day mortality rate. Between the highest and lowest risk quartiles, there is a large middle risk zone, influenced by many elements.

On the other hand, benefits are unlikely at extreme risk or if the “pay off time” of a beneficial strategy is likely to be longer than the expected harm. Neither colorectal screening nor coronary prevention benefits persons with short life expectancy. Also, risk may be greater than benefit in other situations. In diabetic patients with complications, intensive glucose lowering therapy increased mortality and hypoglycemia but was beneficial in others. Another qualification is that the numbers at the high risk levels are small and most events in fact occur in modest and low risk patients (the “prevention paradox,” as it has been called). If the only focus is on the highest risk individuals, many or most targets for treatment will be missed.

Site differences also occur in trials as they do in application of care. While some statistical variation is inevitable and there may be protocol variations in trials, causes also include administrative effectiveness, e.g. in creating collaborative efforts and technical support; skill and experience of providers and teams; adequate hospital capacity for intervention; geographic availabilities; communication problems; and, economics.

Techniques of Evaluation

Subgroup analysis, which is often based on risk, is a common approach but is also controversial, has many pitfalls and has at times been mocked – one famous report recounted subgroup differences according to signs of the Zodiac. Experts have also expressed the opinion that that the “average result of a RCT is usually a more reliable estimate of treatment effect in the various subgroups examined than are the observed effects in individual subgroups.”

A number of subgroup analyses that were first accepted have been shown to be incorrect, e.g. selective effects of the drug amlodipine in ischemic cardiomyopathy and of tamoxifen in women with breast cancer below age 50 years. Others are considered valid and used clinically, e.g. in coronary artery bypass surgery and use of aspirin versus anticoagulants to prevent stroke in atrial fibrillation.

Subgroup differences may be quantitative, i.e. a difference in magnitude from average effect (which may or may not be a reason to treat differently) or, more seriously and less commonly, qualitative (adverse when the study shows average benefit). For subgroup analysis to be valid, it should be part of initial planning (a priori) rather than performed after the trial is completed (a posteriori) and have sufficient power (numbers), a modest number of subgroups analyzed, other statistical particulars. and some biologic or experimental grounding, It also should not be “fishing expedition”. Analyses with less statistical rigor can generate hypotheses for further studies or analyses and it is always possible an undetected group has benefits in an overall negative study.

Genomic study will be a definitive vehicle for individualization of trial results with burgeoning data in many areas, e.g. individualized cancer treatments, other pharmacogenomics such as variation in drug effect or handling, and determination of disease severity. Thus far, CER trials have provided only modest amounts of genomic data though in the future, hopefully, genetic collections will increase in CER, especially with the large genomic databases under construction.

For the most part, clinicians and guidelines apply CER results to the individual patient amidst the complex of other inputs that include: additional trial data, physician evaluation of overall clinical status and risk, genomic data, other biomarkers, site considerations, patient preferences and values and at times the results of subgroup analysis. These formulations constitute what has long been called clinical judgment.

Obtaining Individualized and Group Data

In its new generation, how can CER provide more individualized or group data to inform precise decisions and clinical judgment? Accomplishing this goal requires development of comprehensive strategies utilizing research and non-research data — wide population spectrum trials and registries — rather than relying on any single study.

For initial trial approaches for CER, PCT’s or LST’s are suited as they enroll large and varied patient populations and healthcare sites. These types of trials are more likely to encompass HTE’s than the more narrowly based conventional RCTs. Risk stratification, rigorous subgroup analysis a priori, and similar methods can be used to determine HTEs in these trials.

Observational data and methods can then extend the scope of data of these studies, fill in gaps regarding individuals, groups, and sites, and provide real world evidence. Such data derive from large clinically rich and possibly claims databases, local registries, quality improvement data and other sources of clinical information. Data can then be analyzed using sophisticated observational research methods, such as instrumental variables in natural experiments and others as they develop.

These data modalities broaden the population base and provide subgroup, individual, site, and operator data on outcomes, adverse events, medication adherence, realistic resource use, and other relevant items. In addition, they extend the time of observation, offsetting the relatively short time frame of most CER trials; this is especially helpful for detecting late adverse events, necessity for repeat procedures, etc. Registries can also match the alignment of practice with CER-derived approaches and help improve local treatment and systems differences detected in CER. For example, certain regimens derived from RCTs, such as intensive glucose control, can be hard to implement. Information derived from large trials and databases can also lead to smaller, tailored trials with selected entry criteria and individualized random assignment, and can inform future subgroup analysis. In this overall construct, the follow-up analyses using databases confirm the realistic use of regimens in day-to-day practice.

One analysis along these lines, made to refine pharmacoepidemiologic methods, matched populations in a Medicare and pharmaceutical database with RCT results. It compared one-year coronary disease mortality in 5 categories of elderly patients (age ≥ 65 years) receiving statin therapy to that in 3 RCTs. As one moved from the lower categories representing broad populations, to Category 5, the narrow spectrum of the RCTs, mortality arrived at that of the RCTs. Patient-matching analyses could also be applied to CER results in a variety of ways to define more precise therapeutic strategies.

Broad collaborations are necessary and underway to plan these strategies and form the databases. They include medical centers, academia, health plans, government, specialty societies, etc. Considerable data are available now but it will take some time and resources for a more complete infrastructure to develop. Hopefully its development will occur in conjunction with databases formed and resourced for other uses, such as quality improvement, as part of an overall learning and research strategy encompassed by the Learning Healthcare System.

CER In Guidelines

Until more complete data on risk and HTE are obtained, how do we now use CER in guidelines? Some individual decisions are clear, as for patients that have extensive comorbidities and short life expectancy. We should also remember that even when the quantitative magnitude of effect varies, trial results still generally apply and we have to be careful about being overly restrictive.

To whom does the evidence apply and not apply? Guidelines already contain cautions regarding site and discussion of individualization of diagnostic tests and treatments related to risk and other factors.  The United States Preventive Services Task Force includes, and has criteria for, subgroup analysis in its guidelines.

As much as possible, the strength and boundaries of evidence regarding individuals, groups, and patient characteristics to which a particular study applies, and reasonable expectations as to what can be derived from available data, should be noted. When possible, guidelines should be informed by quantitative risk analysis and include information on methods of evaluating risk and other study particulars. Subgroup analysis can be used provided it adheres very strictly to rigorous statistical criteria (sufficiently powered, etc.). If not, definitive recommendations based on subgroup analysis should have other confirmatory evidence such as selective trials or database analysis. The importance of patient preferences is and should be acknowledged, especially when the choice of alternatives is complex such as in prostate or breast cancer.

When describing studies in guidelines, as much precision as possible is important, especially regarding population entered. In most guidelines, the research rating (hierarchy or level) of evidence as to whether it is strong or weak is based on the quality of the overall study and not the quality of data referable to risk groups or HTEs. These types of ratings should also be applied to individual and group data. If performance measures are derived from guidelines, they should only be applied to groups of patients in whom studies indicate certain benefit and not to others.

A major change will occur if and when guidelines convert to decision support in the EHR environment and as more interactive websites develop.  Then sophisticated subgroup entries, with backup data and local information, as well as decision-analysis models can be made available for both providers and patients at the point of care.