Skip to main content

04 - 4 Decision-Making in Clinical Medicine

4 Decision-Making in Clinical Medicine

■ ■FURTHER READING American Academy of Pediatrics: Talking with vaccine hesitant

parents. Available at www.aap.org/en/patient-care/immunizations/ communicating-with-families-and-promoting-vaccine-confidence/talking-

with-vaccine-hesitant-parents/. Accessed December 15, 2023. Brandt AM: Racism and research: The case of the Tuskegee Syphilis PART 1 The Profession of Medicine Study. Hastings Cent Rep 8:21, 1978. Centers for Disease Control and Prevention: How to address COVID-19 vaccine misinformation. Available at www.cdc.gov/vaccines/ covid-19/health-departments/addressing-vaccine-misinformation. html. Accessed January 15, 2024. Centers for Disease Control and Prevention: Vaccinate with confidence: Strategy to reinforce confidence in Covid-19 vaccines. Available at www.cdc.gov/vaccines/covid-19/vaccinate-with-confidence. html. Accessed January 15, 2024. DeStefano F et al: Principal controversies in vaccine safety in the United States. Clin Infect Dis 69:726, 2019. Dudley MZ et al: The state of vaccine safety science: Systematic reviews of the evidence. Lancet Infect Dis 20:e80, 2020. Immunization Action Coalition: For healthcare professionals. Available at www.immunize.org. Accessed December 15, 2023. Immunization Action Coalition: For the public: Vaccine information you need. Available at vaccineinformation.org. Accessed December 15, 2023. Leask J et al: Communicating with parents about vaccination: A framework for health professionals. BMC Pediatr 12:154, 2012. Lurie N et al: Developing Covid-19 vaccines at pandemic speed. N Engl J Med 382:21, 2020. MacDonald N et al: Vaccine hesitancy: Definition, scope and determi­ nants. Vaccine 33:4161, 2015. Quinn S et al: Addressing vaccine hesitancy in BIPOC communities: Toward trustworthiness, partnership, and reciprocity. N Engl J Med 385:8, 2021. World Health Organization: Infodemic. Available at www.who. int/health-topics/infodemic#tab=tab_1. Accessed January 15, 2024. World Health Organization: Reducing missed opportunities for vaccination (MOV). Available at www.who.int/teams/immunization-

vaccines-and-biologicals/essential-programme-on-immunization/ implementation/reducing-missed-opportunities-for-vaccination. Accessed December 15, 2023. Daniel B. Mark, John B. Wong

Decision-Making in

Clinical Medicine Practicing medicine at its core requires making decisions. What makes medical practice so difficult is not only the specialized technical knowledge required but also the intrinsic uncertainty that surrounds each decision. Mastering the technical aspects of medicine alone, unfortunately, does not ensure a mastery of the practice of medicine.

Sir William Osler’s familiar quote “Medicine is a science of uncer­ tainty and an art of probability” captures well this complex duality. Although the science of medicine is often taught as if the mechanisms of the human body operate with Newtonian predictability, every aspect of medical practice is infused with an element of irreducible uncertainty that the clinician ignores at their peril. Although deeply rooted in science, more than 100 years after the practice of medicine took its modern form, it remains at its core a craft, to which individual doctors bring varying levels of skill, knowledge, and understanding. With the exponential growth in medical literature and other technical

information and an ever-increasing number of testing and treatment options, twenty-first century physicians who seek excellence in their craft must master a more diverse and complex set of skills than any of the generations that preceded them. This chapter introduces three of the pillars upon which the craft of modern medicine rests: (1) expertise in clinical reasoning (what it is and how it can be developed); (2) rational diagnostic test use and interpretation; and (3) integration of the best available research evidence with clinical judgment in the care of indi­ vidual patients (evidence-based medicine [EBM]). ■ ■BRIEF INTRODUCTION TO CLINICAL REASONING Clinical Expertise  Defining “clinical expertise” remains surpris­ ingly difficult. Chess has an objective ranking system based on skill and performance criteria. Athletics, similarly, have ranking systems to distinguish novices from Olympians. But in medicine, after physicians complete training and pass the boards (or get recertified), no tests or benchmarks are used to identify those who have attained the highest levels of clinical performance. At each institution, there are often a few “elite” clinicians who are known for their “special problem-solving prowess” when particularly difficult or obscure cases have baffled everyone else. Yet despite their skill, even such master clinicians typi­ cally cannot explain their exact processes and methods, thereby limit­ ing the acquisition and dissemination of the expertise used to achieve their impressive results. Furthermore, clinical virtuosity appears not to be generalizable, e.g., an expert on hypertrophic cardiomyopathy may be no better (and possibly worse) than a first-year medical resident at diagnosing and managing a patient with neutropenia, fever, and hypotension. Broadly construed, clinical expertise encompasses not only cognitive dimensions involving the integration of disease knowledge with verbal and visual cues and test interpretation but also potentially the complex fine-motor skills necessary for invasive procedures and tests. In addi­ tion, “the complete package” of expertise in medicine requires effective communication and care coordination with patients and members of the medical team. Research on medical expertise remains sparse over­ all and mostly centered on diagnostic reasoning, so this chapter focuses primarily on the cognitive elements of clinical reasoning. Objective study of the clinical reasoning process is difficult as it occurs in the heads of clinicians. One research approach asks clinicians to “think out loud” as they receive increments of clinical information in a manner meant to simulate a clinical encounter. Another research approach focuses on how doctors should reason diagnostically, to identify remediable “errors,” rather than on how they actually do reason. Much of what is known about clinical reasoning comes from empirical studies of nonmedical problem-solving behavior. Because of the diverse perspectives contributing to this area, with important con­ tributions from cognitive psychology, medical education, behavioral economics, sociology, informatics, and decision sciences, no single integrated model of clinical reasoning exists, and not infrequently, dif­ ferent terms and reasoning models describe similar phenomena. Intuitive Versus Analytic Reasoning  A useful contemporary model of reasoning, the dual-process theory distinguishes two gen­ eral conceptual modes of thinking as fast or slow. Intuition (System 1)

provides rapid effortless judgments from memorized associations using pattern recognition and other simplifying “rules of thumb” (i.e., heuristics). For example, a very simple pattern that could be useful in certain situations is “black woman plus hilar adenopathy equals sarcoid.” Because no effort is involved in recalling the pattern, the clinician is often unable to say how those judgments were formulated. In contrast, analysis (System 2), the other form of reasoning in the dual-process model, is slow, methodical, deliberative, and effortful. A student might read about causes of hilar adenopathy and from that list (e.g., Chap. 70), identify diseases more common in black women or examine the patient for skin or eye findings that occur with sarcoid. These dual processes, of course, represent two exemplars taken from the cognitive continuum. They provide helpful descriptive insights but very little guidance in how to develop expertise in clinical reasoning. How these idealized systems interact in different decision problems,

how experts use them differently from novices, and when their usage can lead to errors in judgment remain the subject of study and consid­ erable debate. Pattern recognition, an important part of System 1 reasoning, is a complex cognitive process that appears largely effortless. One can recognize people’s faces, the breed of a dog, an automobile model, or a piece of music from just a few notes within milliseconds without necessarily being able to articulate the specific features that prompted the recognition. Analogously, experienced clinicians often recognize familiar diagnostic patterns very quickly. The key here is having a large library of stored patterns that can be rapidly accessed. In the absence of an extensive stored repertoire of diagnostic patterns, students (as well as experienced clinicians operating outside their area of expertise and familiarity) often must use the more laborious System 2 analytic approach along with more intensive and comprehensive data collection to reach the diagnosis. The following brief patient scenarios illustrate three distinct pat­ terns associated with hemoptysis that experienced clinicians recognize without effort: • A 46-year-old man presents to his internist with a chief complaint of hemoptysis. An otherwise healthy, nonsmoker, he is recovering from an apparent viral bronchitis. This presentation pattern sug­ gests that the small amount of blood-streaked sputum is due to acute bronchitis, so that a chest x-ray provides sufficient reassurance that a more serious disorder is absent. • In the second scenario, a 46-year-old patient who has the same chief complaint but with a 100-pack-year smoking history, a productive morning cough with blood-streaked sputum, and weight loss fits the pattern of carcinoma of the lung. Consequently, along with the chest x-ray, the clinician obtains a sputum cytology examination and refers this patient for a chest computed tomography (CT) scan. • In the third scenario, the clinician hears a soft diastolic rumbling murmur at the apex on cardiac auscultation in a 46-year-old patient with hemoptysis who immigrated from a developing country and orders an echocardiogram as well, because of possible pulmonary hypertension from suspected rheumatic mitral stenosis. Pattern recognition by itself is not, however, sufficient for secure diagnosis. Without deliberative systematic reflection, undisciplined pattern recognition can result in premature closure: mistakenly jump­ ing to the conclusion that one has the correct diagnosis before all the relevant data are in. A critical second step, therefore, even when the diagnosis seems obvious, is diagnostic verification: considering whether the diagnosis adequately accounts for all of the presenting symptoms and signs and can explain all the ancillary findings. The following case based on a real clinical encounter provides an example of premature closure. A 45-year-old man presents with a 3-week history of a “flulike” upper respiratory infection (URI) including dyspnea and a productive cough. The emergency department (ED) clinician pulled out a “URI assessment form,” which defines and standardizes the information gathered. After quickly acquiring the requisite structured examination components and noting in particular the absence of fever and a clear chest examination, the physician prescribed a cough suppressant for acute bronchitis and reassured the patient that his illness was not seri­ ous. Following a sleepless night at home with significant dyspnea, the patient developed nausea and vomiting and collapsed. He was brought back to the ED in cardiac arrest and was unable to be resuscitated. His autopsy showed a posterior wall myocardial infarction (MI) and a fresh thrombus in an atherosclerotic right coronary artery. What went wrong? Presumably, the ED clinician felt that the patient was basically healthy (one can be misled by the way the patient appears on examination—a patient that does not appear “sick” may be incorrectly assumed to have an innocuous illness). So, in this case, the physician, upon hearing the overview of the patient from the triage nurse, elected to use the URI assessment protocol even before starting the history, closing consideration of the broader range of possibilities and associ­ ated tests required to confirm or refute these possibilities. Specifically, by concentrating on the abbreviated and focused URI protocol, the clinician failed to elicit the full dyspnea history, which was precipitated

by exertion and accompanied by chest heaviness and relieved by rest, suggesting a far more serious disorder.

Heuristics or rules of thumb are a part of the intuitive system. These cognitive shortcuts provide a quick and easy path to reaching conclu­ sions and making choices, but when used improperly, they can lead to errors. Two major research programs have studied heuristics in a mostly nonmedical context and have reached different conclusions about the value of these cognitive tools. The “heuristics and biases” program focuses on how these mental shortcuts can lead to incorrect judgments. So far, however, little evidence exists that educating physi­ cians and other decision makers to watch for the >100 cognitive biases identified to date has had any effect on the rate of diagnostic errors. In contrast, the “fast and frugal heuristics” research program explores how and when relying on simple heuristics can produce good decisions. Although many heuristics have relevance to clinical reasoning, only four will be mentioned here. CHAPTER 4 Decision-Making in Clinical Medicine When diagnosing patients, clinicians usually develop diagnostic hypotheses based on the similarity of that patient’s symptoms, signs, and other data to their mental representations (memorized patterns) of the disease possibilities. In other words, clinicians pattern match to identify the diagnoses that share the most similar findings to the patient at hand. This cognitive shortcut is called the representativeness heuristic. Consider a patient with hypertension who has headache, palpitations, and diaphoresis. Given this classic presenting symptom triad suggesting pheochromocytoma, clinicians might judge pheochro­ mocytoma to be quite likely based on the representativeness heuristic. Doing so, however, would be incorrect given that other causes of hypertension are much more common than pheochromocytoma and this triad of symptoms can occur in patients who do not have it. Thus, clinicians using the representativeness heuristic may overestimate the likelihood of a particular disease based on the presence of represen­ tative symptoms and signs, failing to account for its low underlying prevalence (i.e., the prior, or pretest, probabilities). Conversely, atypi­ cal presentations of common diseases may lead to underestimating the likelihood of a particular disease. Thus, inexperience with a specific disease and with the breadth of its presentations may also lead to diag­ nostic delays or errors, e.g., diseases that affect multiple organ systems, such as sarcoid or tuberculosis, may be particularly challenging to diagnose because of the many different patterns they may manifest. A second commonly used cognitive shortcut, the availability heu­ ristic, involves judgments based on how easily prior similar cases or outcomes can be brought to mind. For example, a clinician may recall a case from a morbidity and mortality conference in which an elderly patient presented with painless dyspnea of acute onset and was evalu­ ated for a pulmonary cause but was eventually found to have acute MI, with the diagnostic delay likely contributing to the development of ischemic cardiomyopathy. If the case was associated with a malpractice accusation, such examples may be even more memorable. Errors with the availability heuristic arise from several sources of recall bias. Rare catastrophic outcomes become memorable cases with a clarity and force disproportionate to their likelihood for future diagnosis—for example, a patient with a sore throat eventually found to have leu­ kemia or a young athlete with leg pain subsequently found to have an osteosarcoma—and those publicized in the media or recently experi­ enced are, of course, easier to recall and therefore more influential on clinical judgments. The third commonly used cognitive shortcut, the anchoring heu­ ristic (also called conservatism or stickiness), involves insufficiently adjusting the initial probability of disease up (or down) following a positive (or negative test) when compared with Bayes’ theorem, i.e., sticking to the initial diagnosis. For example, a clinician may still judge the probability of coronary artery disease (CAD) to be high despite a negative exercise perfusion test and go on to cardiac catheterization (see “Measures of Disease Probability and Bayes’ Rule,” below). The fourth heuristic states that clinicians should use the simplest explanation possible that will adequately account for the patient’s symptoms and findings (Occam’s razor or, alternatively, the simplic­ ity heuristic). Although this is an attractive and often used principle, it is important to remember that no biologic basis for it exists. Errors

from the simplicity heuristic include premature closure leading to the neglect of unexplained significant symptoms or findings.

For complex or unfamiliar diagnostic problems, clinicians typi­ cally resort to analytic reasoning processes (System 2) and proceed methodically using the hypothetico-deductive model of reasoning. Based on the patient’s stated reasons for seeking medical attention, clinicians develop an initial list of diagnostic possibilities in hypothesis generation. During the history of the present illness, the initial hypotheses evolve in diagnostic refinement as emerging information is tested against the mental models of the diseases being considered with diagnoses increas­ ing and decreasing in likelihood or even being dropped from or added to consideration as the working hypotheses of the moment. These mental models often generate additional questions that distinguish the diagnostic possibilities from one another. The focused physical exami­ nation contributes to further distinguishing the working hypotheses. Is the spleen enlarged? How big is the liver? Is it tender? Are there any palpable masses or nodules? Diagnostic verification involves testing the adequacy (whether the diagnosis accounts for all symptoms and signs) and coherency (whether the signs and symptoms are consistent with the underlying pathophysiologic causal mechanism) of the working diagnosis. For example, if the enlarged and quite tender liver felt on physical examination is due to acute hepatitis (the hypothesis), then certain specific liver function tests will be markedly elevated (the pre­ diction). Should the tests come back normal, the hypothesis may have to be discarded and others reconsidered. PART 1 The Profession of Medicine Although often neglected, negative findings are as important as positive ones because they reduce the likelihood of the diagnostic hypotheses under consideration. Chest discomfort that is not provoked or worsened by exertion and not relieved by rest in an active patient lowers the likelihood that chronic ischemic heart disease is the under­ lying cause. The absence of a resting tachycardia and thyroid gland enlargement reduces the likelihood of hyperthyroidism in a patient with paroxysmal atrial fibrillation. The acuity of a patient’s illness may override considerations of prevalence and the other issues described above. “Diagnostic impera­ tives” recognize the significance of relatively rare but potentially catastrophic conditions if undiagnosed and untreated. For example, clinicians should consider aortic dissection routinely as a possible cause of acute severe chest discomfort. Although the typical present­ ing symptoms of dissection differ from those of MI, dissection may mimic MI, and because it is far less prevalent and potentially fatal if mistreated, diagnosing dissection remains a challenging diagnostic imperative (Chap. 291). Clinicians taking care of acute, severe chest pain patients should explicitly and routinely inquire about symptoms suggestive of dissection, measure blood pressures in both arms for dis­ crepancies, and examine for pulse deficits. When these are all negative, clinicians may feel sufficiently reassured to discard the aortic dissec­ tion hypothesis. If, however, the chest x-ray shows a possible widened mediastinum, the hypothesis should be reinstated and an appropriate imaging test ordered (e.g., thoracic CT angiography or transesophageal echocardiogram). In nonacute situations, the prevalence of potential alternative diagnoses should play a much more prominent role in diag­ nostic hypothesis generation. Cognitive scientists studying the thought processes of expert clini­ cians have observed that clinicians group data into packets, or “chunks,” that are stored in short-term or “working memory” and manipulated to generate diagnostic hypotheses. Because short-term memory is limited (classically humans can accurately repeat a list of 7 ± 2 numbers read to them), the number of diagnoses that can be actively considered in hypothesis-generating activities is similarly limited. For this reason, the cognitive shortcuts discussed above play a key role in the generation of diagnostic hypotheses, many of which are discarded as rapidly as they are formed, thereby demonstrating that the distinction between analytic and intuitive reasoning is an arbitrary and simplistic, but nonetheless useful, representation of cognition. Research into the hypothetico-deductive model of reasoning has had difficulty identifying the elements of the reasoning process that distinguish experts from novices. This has led to a shift from examining the problem-solving process of experts to analyzing the

organization of their knowledge for pattern matching as exemplars, prototypes, and illness scripts. For example, diagnosis may be based on the resemblance of a new case to patients seen previously (exemplars). As abstract mental models of disease, prototypes incorporate the likeli­ hood of various disease features. Illness scripts include risk factors, pathophysiology, and symptoms and signs. Experts have a much larger store of exemplar and prototype cases, an example of which is the visual long-term memory of experienced radiologists. However, clinicians do not simply rely on literal recall of specific cases but have constructed elaborate conceptual networks of memorized information or models of disease to aid in arriving at their conclusions (illness scripts). That is, expertise involves an enhanced ability to connect symptoms, signs, and risk factors to one another in meaningful ways; relate those findings to possible diagnoses; and identify the additional information necessary to confirm the diagnosis. No single theory accounts for all the key features of expertise in medical diagnosis. Experts have more knowledge about presenting symptoms of diseases and a larger repertoire of cognitive tools to employ in problem solving than nonexperts. One definition of exper­ tise highlights the ability to make powerful distinctions. In this sense, expertise involves a working knowledge of the diagnostic possibilities and those features that distinguish one disease from another. Memo­ rization alone is insufficient, e.g., photographic memory of a medical textbook would not make one an expert. But having access to detailed case-specific relevant information is critically important. In the past, clinicians primarily acquired clinical knowledge through their patient experiences, but now clinicians have access to a plethora of information sources. Clinicians of the future will be able to leverage the experiences of large numbers of other clinicians using electronic tools, but, as with the memorized textbook, the data alone will be insufficient to create expertise. Despite all the research seeking to understand expertise in medi­ cine and other disciplines, it remains uncertain whether any didactic program can actually accelerate the progression from novice to expert or from experienced clinician to master clinician. Deliberate effortful practice (over an extended period of time, sometimes said to be 10 years or 10,000 practice hours) and personal coaching are two strategies often used outside medicine (e.g., music, athletics, chess) to develop expertise. Their use in the context of medical practice has not yet been adequately explored. Some studies in medicine suggest that the most beneficial approach to education exposes students to both the signs and symptoms of specific diseases (disease pattern recognition) and, in addition, the lists of diseases that can present with specific symptoms and signs (differential diagnosis). Active learning opportunities useful for those in training include developing a personal learning system, e.g., systematically reflecting on diagnostic processes used (metacogni­ tion) and following up to identify diagnoses and treatments for patients in their care. ■ ■PERSONALIZED DECISION-MAKING The modern ideal of medical therapeutic decision-making is to “per­ sonalize” treatment recommendations. In the abstract, personalizing treatment involves combining the best available evidence about what works with an individual patient’s unique features (e.g., risk factors, genomics, and comorbidities) and their preferences and health goals to craft an optimal treatment recommendation with the patient. Opera­ tionally, two different and complementary levels of personalization are possible: individualizing the risk of harm and benefit for the options being considered based on the specific patient characteristics (preci­ sion medicine) and personalizing the therapeutic decision process by incorporating the patient’s preferences and values for the possible health outcomes. This latter process is sometimes referred to as shared decision-making and typically involves clinicians sharing their knowl­ edge about the options and the associated consequences and trade-offs and patients sharing their health goals (e.g., avoiding a short-term risk of dying from coronary artery bypass grafting to see their grandchild get married in a few months). Individualizing the evidence about therapy does not mean relying on physician impressions of benefit and harm from their personal

experience. Because of nonrandom selection, small sample sizes, and rare events, the chance of drawing erroneous causal inferences from one’s own clinical experience is very high. For most chronic diseases, the treatment response is a counterfactual concept, only demonstrable statistically in large patient populations. Because of this, it would be incorrect to infer with any certainty, for example, that treating a hyper­ tensive patient with angiotensin-converting enzyme (ACE) inhibitors necessarily prevented a stroke from occurring during treatment, or that an untreated patient would definitely have avoided their stroke had they been treated. For many chronic diseases, a majority of patients will remain event free over long periods of time regardless of treatment choices; some will have events regardless of which treatment is selected; and those who avoided having an event through treatment cannot be individually identified. Blood pressure lowering, a readily observable surrogate endpoint, does not have a tightly coupled relationship with strokes prevented. Consequently, in most situations, demonstrating therapeutic effectiveness cannot rely simply on observing the outcome of an individual patient but should instead be based on large groups of patients carefully studied and properly analyzed. Therapeutic decision-making, therefore, should be based on the best available evidence from clinical trials and well-done outcome studies. Trustworthy clinical practice guidelines that synthesize such evidence offer normative guidance for many testing and treatment decisions. However, all guidelines recognize that “one size fits all” recommenda­ tions may not apply to individual patients. Increased research into the heterogeneity of treatment effects seeks to understand how best to adjust group-level clinical evidence of treatment harms and benefits to account for the absolute level of risks faced by subgroups and even by individual patients, using, for example, validated clinical risk scores. ■ ■NONCLINICAL INFLUENCES ON CLINICAL DECISION-MAKING More than three decades of research on variations in clinician prac­ tice patterns have identified important nonclinical forces that shape clinical decisions. These factors can be grouped conceptually into three overlapping categories: (1) factors related to an individual physician’s practice, (2) factors related to practice setting, and (3) factors related to payment systems. Practice Style  To ensure that necessary care is provided at a high level of quality, physicians fulfill a key role in medical care by serv­ ing as the patient’s advocate. Factors that influence performance in this role include the physician’s knowledge, training, and experience. Clearly, physicians cannot practice EBM if they are unfamiliar with the evidence. As would be expected, specialists generally know the evidence in their field better than do generalists. Beyond published evidence and practice guidelines, a major set of influences on physi­ cian practice can be subsumed under the general concept of “practice style.” The practice style serves to define norms of clinical behavior. Differing practice styles may be based on training, personal experi­ ence, and medical evidence. Beliefs about effectiveness of different therapies and preferred patterns of diagnostic test use are examples of different facets of a practice style. For example, cardiologists evaluating patients with lower risk chest pain symptoms often conceptualize their primary diagnostic objective as maximizing the detection of ischemia. For this reason, they may strongly favor stress imaging. Internists car­ ing for the same patients may be more comfortable with initial use of exercise electrocardiogram (ECG) testing without imaging. This latter practice style focuses less on ischemia detection and more on follow­ ing guideline recommendations that indicate no outcome advantage for stress imaging in this context. Cardiologists, relative to general internists, may also favor a more liberal use of coronary angiography and revascularization in patients with stable ischemic symptoms, i.e., the “oculostenotic reflex.” Beyond the patient’s welfare, physician perceptions about the risk of a malpractice suit resulting from either an erroneous decision or a bad outcome may drive clinical decisions and create a practice referred to as defensive medicine. This practice involves ordering tests and therapies with very small marginal benefits, ostensibly to preclude

future criticism should an adverse outcome occur. Over time, such patterns of care may become accepted as part of the practice norm, thereby perpetuating their overuse, e.g., annual cardiac exercise testing in asymptomatic patients.

CHAPTER 4 Practice Setting  Factors in this category relate to work systems including tasks and workflow (e.g., interruptions, inefficiencies, work­ load), technology (e.g., electronic health record design or implemen­ tation issue,), organizational characteristics (e.g., culture, leadership, staffing, scheduling), and the physical environment (e.g., noise, light­ ing, layout). Physician-induced demand is a term that refers to the repeated observation that once medical facilities and technologies become available to physicians, they will find ways to use them. Other environmental factors that can influence decision-making include the local availability of specialists for consultations and procedures; “high-tech” advanced imaging or procedure facilities such as magnetic resonance imaging (MRI) machines and proton beam therapy centers; and fragmentation of care. Decision-Making in Clinical Medicine Payment Systems  Economic incentives are closely related to the other two categories of practice-modifying factors. Financial issues can exert both stimulatory and inhibitory influences on clinical practice. Historically, physicians have been paid on a fee-for-service, capitation, or salary basis. In fee-for-service, physicians who do more generally get paid more, thereby encouraging overuse, consciously or unconsciously. When fees are reduced (discounted reimbursement), clinicians tend to increase the number of services provided to maintain revenue. Capitation, in contrast, provides a fixed payment per patient per year to encourage physicians to consider a global population budget in managing individual patients and ideally reducing the use of interven­ tions with small marginal benefit. In recognition of the unsustainability of continued growth in medical expenditures and the opportunity costs associated with that (funds that might be more beneficially applied to education, energy, social welfare, or defense), current efforts seek to transition to a value-based payment system to reduce overuse and to reflect benefit. Work to define how to tie payment to value has mostly focused so far on “pay for performance” models. High-quality clinical trial evidence for the effectiveness of these models is still mostly lacking. ■ ■DIAGNOSTIC TEST PERFORMANCE: UNDERSTANDING TEST ACCURACY The purpose of performing a test on a patient is to reduce uncertainty about the patient’s diagnosis or prognosis to facilitate appropriate management. Although diagnostic tests commonly refer to laboratory (e.g., blood count) or imaging tests or procedures (e.g., colonoscopy or bronchoscopy), any information that changes a clinician’s understand­ ing of the patient’s problem qualifies as a diagnostic test. Thus, even the history and physical examination can be considered as diagnostic tests. In clinical medicine, it is common to reduce the results of a test to a dichotomous outcome, such as positive or negative, normal or abnor­ mal. Although this simplification often suppresses useful information (such as the degree of abnormality), it facilitates illustrating some important principles of test interpretation that are described below. The accuracy of any diagnostic test is best assessed relative to a “gold standard,” where a positive gold standard test defines the patients who have disease and a negative test securely rules out disease (Table 4-1). Characterizing the diagnostic performance of a new test requires identifying an appropriate population (ideally, patients representative of those in whom the new test would be used) and applying both the new and the gold standard tests to all subjects. Biased estimates of test performance occur when diagnostic accuracy is defined using an inappropriate population or one in which gold standard determina­ tion of disease status is incomplete. The accuracy of the new test in distinguishing disease from health is determined relative to the gold standard results and summarized in four estimates. The sensitivity or true-positive rate reflects how well the new test identifies patients with disease. It is the proportion of patients with disease (defined by the gold standard) who have a positive test. The proportion of patients with disease who have a negative test is the false-negative rate, calculated as

TABLE 4-1  Measures of Diagnostic Test Accuracy   DISEASE STATUS  TEST RESULT PRESENT ABSENT Positive True positives (TP) False positives (FP) Negative False negatives (FN) True negatives (TN) PART 1 The Profession of Medicine Test Characteristics in Patients with Disease True-positive rate (sensitivity) = TP/(TP + FN) False-negative rate = FN/(TP + FN) = 1 – true-positive rate Test Characteristics in Patients without Disease True-negative rate (specificity) = TN/(TN + FP) False-positive rate = FP/(TN + FP) = 1 – true-negative rate 1 – sensitivity. The specificity, or true-negative rate, reflects how well the new test correctly identifies patients without disease. It is the pro­ portion of patients without disease (defined by the gold standard) who have a negative test. The proportion of patients without disease who have positive test is the false-positive rate, calculated as 1 – specificity. A theoretically perfect test then would have a sensitivity of 100% and a specificity of 100% and would completely distinguish patients with disease from those without it. A useful mnemonic to help remember the somewhat paradoxical relationship between what the test is best at technically versus what it is most useful for clinically is: a test with a very high sensitivity (Sn) when negative (N) helps rule out (out) disease (SnNout), and a test with a very high specificity (Sp) when positive (P) helps rule in (in) disease (SpPin). Calculating sensitivity and specificity requires selection of a thresh­ old value or cut point above which the test is considered “positive.” Making the cut point “stricter” (e.g., raising it) lowers sensitivity but improves specificity, while making it “laxer” (e.g., lowering it) raises sensitivity but lowers specificity. This dynamic trade-off between more accurate identification of patients with disease versus those without disease is often displayed graphically as a receiver operating charac­ teristic (ROC) curve (Fig. 4-1) by plotting sensitivity (y axis) versus 1 – specificity (x axis). Each point on the curve represents a potential cut point with an associated sensitivity and specificity value. The area under the ROC curve often is used as a quantitative measure of the information content of a test. Values range from 0.5 (no diagnostic information from testing at all; the test is equivalent to flipping a coin) to 1.0 (perfect test). The choice of cut point should ideally reflect the relative harms and benefits of treatment for those without versus those with disease. For example, if treatment was safe with substantial ben­ efit, then choosing a high-sensitivity cut point (upper right of the ROC curve) for a low-risk test may be appropriate (e.g., phenylketonuria in newborns), but if treatment had substantial risk for harm, then choos­ ing a high-specificity cut point (lower left of the ROC curve) may be appropriate (e.g., chemotherapy for cancer). The choice of cut point may also depend on the prevalence of disease, with low prevalence placing a greater emphasis on the harms of false-positive tests (e.g., HIV testing in marriage applicants) or the harms of false-negative tests (e.g., HIV testing in blood donors). ■ ■MEASURES OF DISEASE PROBABILITY AND BAYES’ RULE In the absence of perfect tests, the true disease state of the patient remains uncertain after every test. Bayes’ rule provides a way to quan­ tify the revised uncertainty using simple probability mathematics (and thereby avoid anchoring bias). It calculates the posttest probability, or likelihood of disease after a test result, from three parameters: the pretest probability of disease, the test sensitivity, and the test specificity. The pretest probability is a quantitative estimate of the likelihood of the diagnosis before the test is performed and is usually estimated from the prevalence of the disease in the underlying population (if known) or clinical context (e.g., age, sex, and type of chest pain). For some common conditions, such as CAD, existing nomograms and statistical models generate estimates of pretest probability that account for his­ tory, physical examination, and test findings. The posttest probability

0.9 0.8 0.7 True-positive rate 0.6 0.5 0.4 0.3 0.2 Good Fair No predictive value 0.1

0.1 0.2 0.3 0.4 False-positive rate 0.5 0.6 0.7 0.8 0.9

FIGURE 4-1  Each receiver operating characteristic (ROC) curve illustrates a tradeoff that occurs between improved test sensitivity (accurate detection of patients with disease) and improved test specificity (accurate detection of patients without disease), as the test value defining when the test turns from “negative” to “positive” is varied. A 45° line would indicate a test with no predictive value (sensitivity = specificity at every test value). The area under each ROC curve is a measure of the information content of the test. Thus, a larger ROC area signifies increased diagnostic accuracy. (also called the predictive value of the test, see below) is a recalibrated statement of the probability of the diagnosis, accounting for both pre­ test probability and test results. For the probability of disease following a positive test (i.e., positive predictive value), Bayes’ rule is calculated as:

× × + × Posttest probability Pretest probability test sensitivity Pretest probability test sensitivity (1–Pretest probability) (false-positive test rate) For example, consider a 64-year-old woman with atypical chest pain who has a pretest probability of 0.50 and a “positive” diagnostic test result (assuming test sensitivity = 0.90 and specificity = 0.90). Posttest probability (0.50)(0.90) (0.50)(0.90) (0.50)(0.10) 0.90

= The term predictive value has often been used as a synonym for the posttest probability. Unfortunately, clinicians commonly misinterpret reported predictive values as intrinsic measures of test accuracy rather than calculated probabilities. Studies of diagnostic test performance compound the confusion by calculating predictive values from the same sample used to measure sensitivity and specificity. Such calcula­ tions are misleading unless the test is applied subsequently to popula­ tions with exactly the same disease prevalence. For these reasons, the more descriptive term, posttest probability following a positive or a negative test, is preferred over predictive value. The nomogram version of Bayes’ rule (Fig. 4-2) helps us to under­ stand at a conceptual level how it estimates the posttest probability of disease. In this nomogram, the impact of the diagnostic test result is summarized by the likelihood ratio, which is defined as the ratio of the probability of a given test result (e.g., “positive” or “negative”) in a

0.1 0.1 0.5 0.2

0.5 0.5

0.5

0.05 0.1 0.2

0.02

0.01

0.5

0.2

0.1

Pretest Probability, % Posttest Probability, % Likelihood Ratio Pretest Probability, % Posttest Probability, % Likelihood Ratio FIGURE 4-2  Nomogram version of Bayes’ theorem used to predict the posttest probability of disease (right-hand scale) using the pretest probability of disease (left-hand scale) and the likelihood ratio for a positive or a negative test (middle scale). See text for information on calculation of likelihood ratios. To use, place a straightedge connecting the pretest probability and the likelihood ratio and read off the posttest probability. The right-hand part of the figure illustrates the value of a positive exercise treadmill test (likelihood ratio 4, green line) and a positive exercise thallium single-photon emission CT perfusion study (likelihood ratio 9, broken brown line) in a patient with a pretest probability of coronary artery disease of 50%. (Adapted from Centre for Evidence-Based Medicine: Likelihood ratios. Available at http://www.cebm.net/ likelihood-ratios/.) patient with disease to the probability of that result in a patient without disease, thereby providing a measure of how well the test distinguishes those with from those without disease. The likelihood ratio for a positive test is calculated as the ratio of the true-positive rate to the false-positive rate (or sensitivity/[1 – specificity]). For example, a test with a sensitivity of 0.90 and a specificity of 0.90 has a likelihood ratio of 0.90/(1 – 0.90), or 9. Thus, for this hypotheti­ cal test, a “positive” result is 9 times more likely in a patient with the disease than in a patient without it. Most tests in medicine have likeli­ hood ratios for a positive result between 1.5 and 20. Higher values are associated with tests that more substantially increase the posttest likelihood of disease. A very high likelihood ratio positive (>10) usually implies high specificity, so a positive high-specificity test helps “rule in” disease (the “SpPin” mnemonic introduced earlier). If sensitivity is excellent but specificity is less so, the likelihood ratio positive will be reduced substantially (e.g., with a 90% sensitivity but a 55% specificity, the likelihood ratio positive is 2.0). The corresponding likelihood ratio for a negative test is the ratio of the false-negative rate to the true-negative rate (or [1 – sensitivity]/ specificity). Lower likelihood ratio negative values more substan­ tially lower the posttest likelihood of disease. A very low likelihood ratio negative (falling below 0.10) usually implies high sensitivity, so a negative high-sensitivity test helps “rule out” disease (the SnNout mnemonic). The hypothetical test considered above with a sensitivity of 0.9 and a specificity of 0.9 would have a likelihood ratio for a nega­ tive test result of (1 – 0.9)/0.9, or 0.11, meaning that a negative result

is about one-tenth as likely in patients with disease than in those without dis­ ease (or about 10 times more likely in those without disease than in those with disease).

CHAPTER 4

■ ■APPLICATIONS TO DIAGNOSTIC TESTING

IN CAD Consider two tests commonly used in the diagnosis of CAD: an exercise treadmill and an exercise single-photon emission CT (SPECT) myocardial per­ fusion imaging test (Chap. 248). A positive treadmill ST-segment response has an average sensitivity of ~60% and an average specificity of ~75%, yielding a likelihood ratio positive of 2.4 (0.60/ [1 – 0.75]) (consistent with modest discriminatory ability because it falls between 2 and 5). A 41-year-old man with atypical chest pain and no other risk factors has about a 10% pretest probability of CAD. After a positive result, the posttest probability of disease rises to only ~30%. For a 60-year-old man with typical angina and multiple risk factors, the pretest probability of CAD is about 80%. After a positive test result, the posttest probability of disease rises to ~95%.

Decision-Making in Clinical Medicine

0.5 0.05 0.1 0.2

0.02 0.01

0.5 0.2 In contrast, exercise SPECT myo­ cardial perfusion test is more accurate for diagnosis of CAD. For simplicity, assume that the finding of a revers­ ible exercise-induced perfusion defect has both a sensitivity and a specificity of 90% (a bit higher than reported), yielding a likelihood ratio for a positive test of 9.0 (0.90/[1 – 0.90]) (consistent with intermediate discriminatory abil­ ity because it falls between 5 and 10). For the same 10% pretest probability patient, a positive test raises the probability of CAD to 50% (Fig. 4-2). However, despite the differences in posttest probabilities between these two tests (30 vs 50%), the more accurate test may not improve diag­ nostic likelihood enough to change patient management (e.g., decision to refer to cardiac catheterization) because the more accurate test has only moved the physician from being fairly certain that the patient did not have CAD to a 50:50 chance of disease. In a patient with a pretest probability of 80%, exercise SPECT test raises the posttest probability to 97% (compared with 95% for the exercise treadmill). Again, the more accurate test does not provide enough improvement in posttest confidence to alter management, and neither test has improved much on what was known from clinical data alone. 0.1 In general, positive results with an accurate test (e.g., likelihood ratio for a positive test of 10) when the pretest probability is low (e.g., 20%) do not move the posttest probability to a range high enough to rule in disease (e.g., 80%). In screening situations, pretest probabilities are often particularly low because patients are asymptomatic. In such cases, specificity becomes especially important. For example, in screen­ ing first-time female blood donors without risk factors for HIV, a posi­ tive test raised the likelihood of HIV to only 67% despite a specificity of 99.995% because the prevalence was 0.01%. Conversely, with a high pretest probability, a negative test may not rule out disease adequately if it is not sufficiently sensitive. Thus, the largest change in diagnostic likelihood following a test result occurs when the clinician is most uncertain (i.e., pretest probability between 30 and 70%). For example, a 70-year-old woman with typical angina and multiple risk factors has a

pretest probability for CAD of ~50%. A positive exercise treadmill test moves the posttest probability to 80%, and a positive exercise SPECT perfusion test moves it to 90% (Fig. 4-2).

As presented above, Bayes’ rule employs a number of important simplifications that should be considered. First, few tests provide only “positive” or “negative” results. Many tests have multidimensional out­ comes (e.g., extent of ST-segment depression, exercise duration, and exercise-induced symptoms with exercise testing). Although Bayes’ theorem can be adapted to this more detailed test result format, it is computationally more complex to do so. Similarly, when multiple sequential tests are performed, the posttest probability may be used as the pretest probability to interpret the second test. However, this simplification assumes conditional independence—that is, that the results of the first test do not affect the likelihood of the second test result—and this is often not true. PART 1 The Profession of Medicine Finally, many texts assert that sensitivity and specificity are prevalence-independent parameters of test accuracy. This statistically useful assumption, however, is often incorrect. A treadmill exercise test, for example, has a sensitivity of ~30% in a population of patients with one-vessel CAD, whereas its sensitivity in patients with severe three-vessel CAD approaches 80%. Thus, the best estimate of sensitiv­ ity to use in a particular decision may vary, depending on the severity of disease in the local population. A hospitalized, symptomatic, or referral population typically has a higher prevalence of disease and, importantly, a higher prevalence of more advanced disease than does an outpatient population. Consequently, test sensitivity will likely be higher in hospitalized patients and test specificity higher in outpatients. ■ ■RISK PREDICTION MODELS Bayes’ rule, when used as presented above, is useful in studying diag­ nostic testing concepts, but predictions based on multivariable statisti­ cal models can more accurately address these more complex problems by simultaneously accounting for additional relevant patient charac­ teristics. These models explicitly account for multiple, even possibly overlapping, pieces of patient-specific information and assign a relative weight to each on the basis of its unique independent contribution to the prediction in question. For example, a logistic regression model to predict the probability of CAD ideally considers all the relevant inde­ pendent factors from the clinical examination and diagnostic testing and their relative importance instead of the limited data that clinicians can manage in their heads or with Bayes’ rule. However, despite this strength, prediction models are usually too complex computationally to use without a calculator or computer. Guideline-driven treatment recommendations based on statistical prediction models available online, e.g., the American College of Cardiology/American Heart Association risk calculator for primary prevention with statins and the CHA2DS2-VASC calculator for anticoagulation for atrial fibrilla­ tion, have generated more widespread usage. Some predictive models are now embedded into electronic health record (EHR) systems, most commonly addressing issues related to thrombosis/anticoagulation and to sepsis. Evidence about the impact of these EHR-based models on patient outcomes is mostly observational and suggests more work is needed to deliver the risk information to the right clinician at the right time in a way that supports clinical workflow. One reason for limited clinical use is that, to date, only a handful of prediction models have been validated sufficiently (e.g., Wells criteria for pulmonary embolism; Table 4-2). The importance of independent validation in a population separate from the one used to develop the model cannot be overstated. An unvalidated risk prediction model should be viewed with the skepticism appropriate for any new drug or medical device that has not had rigorous clinical trial testing. When statistical survival models in cancer and heart disease have been compared directly with clinicians’ predictions, the survival mod­ els have been found to be more consistent, as would be expected, but not always more accurate. On the other hand, comparison of clinicians with websites and apps that generate lists of possible diagnoses to help patients with self-diagnosis found that physicians outperformed the currently available programs. For students and less-experienced clinicians, the biggest value of diagnostic decision support may be in

TABLE 4-2  Wells Clinical Prediction Rule for Pulmonary

Embolism (PE) CLINICAL FEATURE POINTS Clinical signs of deep-vein thrombosis

Alternative diagnosis is less likely than PE

Heart rate >100 beats/min 1.5 Immobilization ≥3 days or surgery in previous 4 weeks 1.5 History of deep-vein thrombosis or PE 1.5 Hemoptysis

Malignancy (with treatment within 6 months) or palliative

INTERPRETATION   Score >6.0 High Score 2.0–6.0 Intermediate Score <2.0 Low extending diagnostic possibilities and triggering “rational override,” but their impact on knowledge, information-seeking, and problemsolving needs additional research. FORMAL DECISION SUPPORT TOOLS ■ ■DECISION SUPPORT SYSTEMS AND ARTIFICIAL INTELLIGENCE Over the past 50 years, many attempts have been made to develop computer systems to aid clinical decision-making and patient man­ agement. Conceptually, computers offer several levels of potentially useful support for clinicians. At the most basic level, they provide ready access to vast reservoirs of information, which may, however, be quite difficult to sort through to find what is needed. At higher levels, computers can support care management decisions by making accu­ rate predictions of outcome, or can simulate the whole decision pro­ cess, and provide algorithmic guidance. Computer-based predictions using Bayesian or statistical regression models inform a clinical deci­ sion but do not actually reach a “conclusion” or “recommendation.” Recent advances in artificial intelligence (AI) suggest that medicine is on the threshold of developing much more powerful digital tools, but current enthusiasm for such tools still exceeds demonstrated utility in clinical care. Work on AI dates back to the 1950s and can be separated into three major subtypes: neural networks, machine learning (and its subtype deep learning), and generative AI. Machine learning methods are being applied to pattern recognition tasks such as the examination of skin lesions and the interpretation of x-rays. Generative AI (AI models that generate new content) is a term that covers several differ­ ent types of systems, including large language models (which gener­ ate language-based content; an example is GPT-4). Large language models offer some promise in helping to create clinical notes. Their use in support of clinical decision-making, however, is still at a very preliminary stage with need for independent validation in a popula­ tion separate from the one used to develop the model. Early evidence suggests that clinicians are willing to rely on AI-based tools even when the information provided is clearly inaccurate or contradictory. Concerns about model confabulation and the potential for patient harms mandate careful and comprehensive testing before AI tools are implemented in clinical care. Reminder or protocol-directed systems do not make predictions but use existing algorithms, such as guidelines or appropriate utiliza­ tion criteria, to direct clinical practice. In general, however, decision support systems have so far had little impact on practice. Reminder systems built into EHRs have shown the most promise, particularly in correcting drug dosing and promoting adherence to guidelines. Check­ lists may also help avoid or reduce errors. ■ ■DECISION ANALYSIS Compared with the decision support methods discussed earlier, decision analysis represents a normative prescriptive approach to decision-making in the face of uncertainty. Its principal application

is in complex decisions. For example, public health policy decisions often involve trade-offs in length versus quality of life, benefits versus resource use, population versus individual health, and uncertainty regarding efficacy, effectiveness, and adverse events as well as values or preferences regarding mortality and morbidity outcomes. One recent analysis using this approach involved the optimal screening strategy for breast cancer, which has remained controversial, in part because a randomized controlled trial to determine when to begin screening and how often to repeat screening mammography is impractical. In 2016, the National Cancer Institute–sponsored Cancer Intervention and Surveillance Network (CISNET) examined eight strategies differing by whether to initiate mammography screening at age 40, 45, or 50 years and whether to screen annually, biennially, or annually for women in their forties and biennially thereafter (hybrid). The six simulation models found biennial strategies to be the most efficient for average-risk women. Biennial screening for 1000 women from age 50–74 years versus no screening avoided seven breast cancer deaths. Screening annually from age 40–74 years avoided three addi­ tional deaths but required 20,000 additional mammograms and yielded 1988 more false-positive results. Factors that influenced the results included patients with a two- to fourfold higher risk for developing breast cancer in whom annual screening from age 40–74 years yielded similar benefits as biennial screening from age 50–74. For average-risk patients with moderate or severe comorbidities, screening could be stopped earlier, at age 66–68 years. This analysis involved six models that reproduced epidemiologic trends and a screening trial result, accounted for digital technology and treatments advances, and considered quality of life, risk factors, breast density, and comorbidity. It provided novel insights into a public health problem in the absence of randomized clinical trials examining alter­ native start age, stop age, and screening frequencies, and helped weigh the pros and cons of such a health policy recommendation. Although such models have been developed for selected clinical problems, their benefit and application to individual real-time clinical management has yet to be demonstrated. DIAGNOSIS AS AN ELEMENT OF

QUALITY OF CARE High-quality medical care begins with accurate diagnosis. The inci­ dence of diagnostic errors has been estimated by a variety of methods including postmortem examinations, medical record reviews, and medical malpractice claims, with each yielding complementary but different estimates of this quality of care patient-safety problem. In the past, diagnostic errors tended to be viewed as a failure of individual clinicians. The modern view is that they are mostly a system of care deficiencies. Current estimates suggest that nearly everyone will expe­ rience at least one diagnostic error in their lifetime, leading to mortal­ ity, morbidity, unnecessary tests and procedures, costs, and anxiety. Solutions to the “diagnostic errors as a system of care” problem have focused on system-level approaches, such as decision support and other tools integrated into EHRs. The use of checklists has been proposed as a means of reducing some of the cognitive errors discussed earlier in the chapter, such as premature closure. While checklists have been shown to be useful in certain medical contexts, such as operating rooms and intensive care units, their value in preventing diagnostic errors that lead to patient adverse events remains to be shown. EVIDENCE-BASED MEDICINE Clinical medicine is defined traditionally as a practice combining medical knowledge (including scientific evidence), intuition, and judg­ ment in the care of patients (Chap. 1). EBM updates this construct by placing much greater emphasis on the processes by which clinicians gain knowledge of the most up-to-date and relevant clinical research to determine for themselves whether medical interventions alter the disease course and improve the length or quality of life. The phrase “evidence-based medicine” is now used so often and in so many differ­ ent contexts that many practitioners are unaware of its original mean­ ing. The intention of the EBM program, as described in the early 1990s

by its founding proponents at McMaster University, becomes clearer through an examination of its four key steps:

  1. Formulating the management question to be answered
  2. Searching the literature and online databases for applicable research CHAPTER 4 data
  3. Appraising the evidence gathered with regard to its validity and relevance
  4. Integrating this appraisal with knowledge about the unique aspects of the patient (including the patient’s preferences about the possible outcomes) Decision-Making in Clinical Medicine The process of searching the world’s research literature and apprais­ ing the quality and relevance of studies can be time-consuming and requires skills and training that most clinicians do not possess. In a busy clinical practice, the work required is also logistically not fea­ sible. This has led to a focus on finding recent systematic overviews of the problem in question as a useful shortcut in the EBM process. Sys­ tematic reviews are regarded by some as the highest level of evidence in the EBM hierarchy because they are intended to comprehensively summarize the available evidence on a particular topic. To avoid the potential biases found in narrative review articles, predefined repro­ ducible explicit search strategies and inclusion and exclusion criteria seek to find all of the relevant scientific research and grade its quality. The prototype for this kind of resource is the Cochrane Database of Systematic Reviews. When appropriate, a meta-analysis is used to quantitatively summarize the systematic review findings (discussed further below). Unfortunately, systematic reviews are not uniformly the acme of the EBM process they were initially envisioned to be. In select cir­ cumstances, they can provide a much clearer picture of the state of the evidence than is available from any individual clinical report, but their value is less clear when only a few trials are available, when trials and observational studies are mixed, or when the evidence base is only observational. They cannot compensate for deficiencies in the under­ lying research available, and many are created without the requisite clinical insights. The medical literature is now flooded with systematic reviews of varying quality and clinical utility. The peer review system has, unfortunately, not proved to be an effective arbiter of quality of these papers. Therefore, systematic reviews should be used with cir­ cumspection in conjunction with selective reading of some of the best empirical studies. ■ ■SOURCES OF EVIDENCE: CLINICAL

TRIALS AND REGISTRIES The notion of learning from observation of patients is as old as medi­ cine itself. Over the past 50 years, physicians’ understanding of how best to turn raw observation into useful evidence has evolved consider­ ably. Medicine has received a hard refresher lesson in this process from the COVID-19 pandemic. Starting in the spring of 2020, case reports, personal and institutional anecdotal experience, and small singlecenter case series started appearing in the peer-reviewed literature and within months turned into a flood of confusing and often contradic­ tory evidence. Observational reports of treatments for COVID-19 fueled the confusion. Despite >40,000 publications appearing in the first 7 months of the pandemic, an enormous amount of uncertainty around prevention, diagnosis, treatment, and prognosis of the dis­ ease remained. Many of the early 2020 publications were either small observational series or reviews of published series, neither of which can resolve the key uncertainties clinicians need to address in caring for these patients. These small observational studies often have sub­ stantial limitations in validity and generalizability, and although they may generate important hypotheses or be the first reports of adverse events or therapeutic benefit, they have no role in formulating modern standards of practice. The major tools used to develop reliable evidence consist of randomized clinical trials supplemented strategically by large (high-quality) observational registries. A registry or database typically is focused on a disease or syndrome (e.g., different types of cancer, acute or chronic CAD, pacemaker capture, or chronic heart failure), a clinical procedure (e.g., bone marrow transplantation, coronary

revascularization), or an administrative process (e.g., claims data used for billing and reimbursement).

By definition, in observational data, the investigator does not con­ trol patient care. Carefully collected prospective observational data, however, can at times achieve a level of evidence quality approaching that of major clinical trial data through trial emulation (specifying eligibility criteria, interventions, outcome, follow-up, causal contrast, and statistical analysis) using causal inference methods. At the other end of the spectrum, data collected retrospectively (e.g., chart review) are limited in form and content to what previous observers recorded and may not include the specific research data being sought (e.g., claims data). Advantages of observational data include the inclusion of a broader population as encountered in practice than is typically represented in clinical trials because of their restrictive inclusion and exclusion criteria. In addition, observational data provide primary evidence for research questions when a randomized trial cannot be performed. For example, it would be difficult to randomize patients to test diagnostic or therapeutic strategies that are unproven but widely accepted in practice, and it would be unethical to randomize based on sex, racial/ethnic group, socioeconomic status, or country of residence or to randomize patients to a potentially harmful intervention, such as smoking or deliberately overeating to develop obesity. PART 1 The Profession of Medicine A well-done prospective observational study of a particular man­ agement strategy differs from a well-done randomized clinical trial most importantly by its lack of protection from treatment selection bias. The use of observational data to compare diagnostic or thera­ peutic strategies assumes that sufficient uncertainty and heteroge­ neity exists in clinical practice to ensure that similar patients will be managed differently by diverse physicians. In short, the analysis assumes that a sufficient element of randomness (in the sense of disorder rather than in the formal statistical sense) exists in clini­ cal management. In such cases, statistical models attempt to adjust for important imbalances to “level the playing field” so that a fair comparison among treatment options can be made. When manage­ ment is clearly not random (e.g., all eligible left main CAD patients are referred for coronary bypass surgery), the problem may be too confounded (biased) for statistical correction, and observational data may not provide reliable evidence. In general, the use of concurrent controls is vastly preferable to that of historical controls. For example, comparison of current surgical management of left main CAD with medically treated patients with left main CAD during the 1970s (the last time these patients were routinely treated with medicine alone) would be extremely misleading because “medical therapy” has substantially improved in the interim. Randomized controlled clinical trials include the careful prospective design features of the best observational data studies but also include the use of random allocation of treatment. This design provides the best protection against measured and unmeasured confounding due to treatment selection bias (a major aspect of internal validity). However, the randomized trial may not have good external validity (generaliz­ ability) if the process of recruitment into the trial resulted in the exclu­ sion of many potentially eligible subjects or if the nominal eligibility for the trial describes a very heterogeneous population. Consumers of medical evidence need to be aware that randomized trials vary widely in their quality and applicability to practice. The process of designing such a trial often involves many compromises. For example, trials designed to gain U.S. Food and Drug Administra­ tion (FDA) approval for an investigational drug or device must fulfill regulatory requirements (such as the use of a placebo control) that may result in a trial population and design that differ substantially from what practicing clinicians would find most useful. ■ ■META-ANALYSIS The Greek prefix meta signifies something at a later or higher stage of development. Meta-analysis is research that combines and sum­ marizes the available evidence quantitatively. Although it is used to examine nonrandomized studies, meta-analysis is most useful for summarizing all available randomized trials examining a particular therapy used in a specific clinical context. Ideally, unpublished trials

should be identified and included to avoid publication bias (i.e., miss­ ing “negative” trials that may not be published). Furthermore, the best meta-analyses obtain and analyze individual patient-level data from all trials rather than using only the summary data from published reports. Nonetheless, not all published meta-analyses yield reliable evidence for a particular problem, so their methodology should be scrutinized carefully to ensure proper study design and analysis. The results of a well-done meta-analysis are likely to be most persuasive if they include at least several large-scale, properly performed random­ ized trials. Meta-analysis can especially help detect benefits when individual trials are inadequately powered (e.g., the benefits of strep­ tokinase thrombolytic therapy in acute MI demonstrated by ISIS-2 in 1988 were evident by the early 1970s through meta-analysis). However, in cases in which the available trials are small or poorly done, metaanalysis should not be viewed as a remedy for deficiencies in primary trial data or trial design. Meta-analyses typically focus on summary measures of relative treatment benefit, such as odds ratios or relative risks. Clinicians should also examine what absolute risk reduction (ARR) can be expected from the therapy. A metric of absolute treatment benefit that is frequently reported is the number needed to treat (NNT) to prevent one adverse outcome event (e.g., death, stroke). NNT should not be interpreted literally as a causal statement. NNT is simply 1/ARR. For example, if a hypothetical therapy reduced mortality rates over a 5-year follow-up by 33% (the relative treatment benefit) from 12% (control arm) to 8% (treatment arm), the ARR would be 12% – 8% = 4% and the NNT would be 1/0.04, or 25. This does not mean literally that 1 patient benefits and 24 do not. However, it can be conceptualized as an informal measure of treatment efficiency. If the hypothetical treatment was applied to a lower-risk population, say, with a 6% 5-year mortal­ ity, the 33% relative treatment benefit would reduce absolute mortality

by 2% (from 6 to 4%), and the NNT for the same therapy in this lowerrisk group of patients would be 50. Although often not made explicit, comparisons of NNT estimates from different studies should account for the duration of follow-up used to create each estimate. In addition, the NNT concept assumes a homogeneity in response to treatment that may not be accurate. The NNT is simply another way of summarizing the absolute treatment difference and does not provide any unique information. ■ ■CLINICAL PRACTICE GUIDELINES Per the 1990 Institute of Medicine definition, clinical practice guide­ lines are “systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances.” This definition emphasizes several crucial features of modern guideline development. First, guidelines are created by using the tools of EBM. In particular, the core of the development process is a systematic literature search followed by a review of the relevant peer-

reviewed literature. Second, guidelines usually are focused on a clinical disorder (e.g., diabetes mellitus, stable angina pectoris) or a health care intervention (e.g., cancer screening). Third, the primary objective of guidelines is to improve the quality of medical care by identifying care practices that should be routinely implemented, based on high-quality evidence and high benefit-to-harm ratios for the interventions. Guidelines are intended to “assist” decisionmaking, not to define explicitly what decisions should be made in a particular situation, in part because guideline-level evidence alone is never sufficient for clinical decision-making (e.g., deciding whether to intubate and administer antibiotics for pneumonia in a terminally ill individual, in an individual with dementia, or in an otherwise healthy 30-year-old mother). Guidelines are narrative documents constructed by expert panels whose composition often is determined by interested professional organizations. These panels vary in expertise and in the degree to which they represent all relevant stakeholders. The guideline docu­ ments consist of a series of specific management recommendations, a summary indication of the quantity and quality of evidence sup­ porting each recommendation, an assessment of the benefit-to-harm ratio for the recommendation, and a narrative discussion of the