# 04 - 4 Decision-Making in Clinical Medicine

## 4 Decision-Making in Clinical Medicine

■
■FURTHER READING
American Academy of Pediatrics: Talking with vaccine hesitant 

parents. Available at www.aap.org/en/patient-care/immunizations/
communicating-with-families-and-promoting-vaccine-confidence/talking-

with-vaccine-hesitant-parents/. Accessed December 15, 2023.
Brandt AM: Racism and research: The case of the Tuskegee Syphilis 
PART 1
The Profession of Medicine
Study. Hastings Cent Rep 8:21, 1978.
Centers for Disease Control and Prevention: How to address 
COVID-19 vaccine misinformation. Available at www.cdc.gov/vaccines/
covid-19/health-departments/addressing-vaccine-misinformation.
html. Accessed January 15, 2024.
Centers for Disease Control and Prevention: Vaccinate with 
confidence: Strategy to reinforce confidence in Covid-19 vaccines. 
Available at www.cdc.gov/vaccines/covid-19/vaccinate-with-confidence.
html. Accessed January 15, 2024.
DeStefano F et al: Principal controversies in vaccine safety in the 
United States. Clin Infect Dis 69:726, 2019.
Dudley MZ et al: The state of vaccine safety science: Systematic 
reviews of the evidence. Lancet Infect Dis 20:e80, 2020.
Immunization Action Coalition: For healthcare professionals. 
Available at www.immunize.org. Accessed December 15, 2023.
Immunization Action Coalition: For the public: Vaccine information 
you need. Available at vaccineinformation.org. Accessed December 15, 
2023.
Leask J et al: Communicating with parents about vaccination: A 
framework for health professionals. BMC Pediatr 12:154, 2012.
Lurie N et al: Developing Covid-19 vaccines at pandemic speed. N Engl 
J Med 382:21, 2020.
MacDonald N et al: Vaccine hesitancy: Definition, scope and determi­
nants. Vaccine 33:4161, 2015.
Quinn S et al: Addressing vaccine hesitancy in BIPOC communities: 
Toward trustworthiness, partnership, and reciprocity. N Engl J Med 
385:8, 2021.
World Health Organization: Infodemic. Available at www.who.
int/health-topics/infodemic#tab=tab_1. Accessed January 15, 2024.
World Health Organization: Reducing missed opportunities for 
vaccination (MOV). Available at www.who.int/teams/immunization-

vaccines-and-biologicals/essential-programme-on-immunization/
implementation/reducing-missed-opportunities-for-vaccination. 
Accessed December 15, 2023.
Daniel B. Mark, John B. Wong

Decision-Making in 

Clinical Medicine
Practicing medicine at its core requires making decisions. What makes 
medical practice so difficult is not only the specialized technical 
knowledge required but also the intrinsic uncertainty that surrounds 
each decision. Mastering the technical aspects of medicine alone, 
unfortunately, does not ensure a mastery of the practice of medicine. 

Sir William Osler’s familiar quote “Medicine is a science of uncer­
tainty and an art of probability” captures well this complex duality. 
Although the science of medicine is often taught as if the mechanisms 
of the human body operate with Newtonian predictability, every 
aspect of medical practice is infused with an element of irreducible 
uncertainty that the clinician ignores at their peril. Although deeply 
rooted in science, more than 100 years after the practice of medicine 
took its modern form, it remains at its core a craft, to which individual 
doctors bring varying levels of skill, knowledge, and understanding. 
With the exponential growth in medical literature and other technical 

information and an ever-increasing number of testing and treatment 
options, twenty-first century physicians who seek excellence in their 
craft must master a more diverse and complex set of skills than any of 
the generations that preceded them. This chapter introduces three of 
the pillars upon which the craft of modern medicine rests: (1) expertise 
in clinical reasoning (what it is and how it can be developed); (2) rational 
diagnostic test use and interpretation; and (3) integration of the best 
available research evidence with clinical judgment in the care of indi­
vidual patients (evidence-based medicine [EBM]).
■
■BRIEF INTRODUCTION TO CLINICAL REASONING
Clinical Expertise 
Defining “clinical expertise” remains surpris­
ingly difficult. Chess has an objective ranking system based on skill 
and performance criteria. Athletics, similarly, have ranking systems to 
distinguish novices from Olympians. But in medicine, after physicians 
complete training and pass the boards (or get recertified), no tests or 
benchmarks are used to identify those who have attained the highest 
levels of clinical performance. At each institution, there are often a few 
“elite” clinicians who are known for their “special problem-solving 
prowess” when particularly difficult or obscure cases have baffled 
everyone else. Yet despite their skill, even such master clinicians typi­
cally cannot explain their exact processes and methods, thereby limit­
ing the acquisition and dissemination of the expertise used to achieve 
their impressive results. Furthermore, clinical virtuosity appears not to 
be generalizable, e.g., an expert on hypertrophic cardiomyopathy may 
be no better (and possibly worse) than a first-year medical resident 
at diagnosing and managing a patient with neutropenia, fever, and 
hypotension.
Broadly construed, clinical expertise encompasses not only cognitive 
dimensions involving the integration of disease knowledge with verbal 
and visual cues and test interpretation but also potentially the complex 
fine-motor skills necessary for invasive procedures and tests. In addi­
tion, “the complete package” of expertise in medicine requires effective 
communication and care coordination with patients and members of 
the medical team. Research on medical expertise remains sparse over­
all and mostly centered on diagnostic reasoning, so this chapter focuses 
primarily on the cognitive elements of clinical reasoning.
Objective study of the clinical reasoning process is difficult as it 
occurs in the heads of clinicians. One research approach asks clinicians 
to “think out loud” as they receive increments of clinical information 
in a manner meant to simulate a clinical encounter. Another research 
approach focuses on how doctors should reason diagnostically, to 
identify remediable “errors,” rather than on how they actually do 
reason. Much of what is known about clinical reasoning comes from 
empirical studies of nonmedical problem-solving behavior. Because of 
the diverse perspectives contributing to this area, with important con­
tributions from cognitive psychology, medical education, behavioral 
economics, sociology, informatics, and decision sciences, no single 
integrated model of clinical reasoning exists, and not infrequently, dif­
ferent terms and reasoning models describe similar phenomena.
Intuitive Versus Analytic Reasoning 
A useful contemporary 
model of reasoning, the dual-process theory distinguishes two gen­
eral conceptual modes of thinking as fast or slow. Intuition (System 1) 

provides rapid effortless judgments from memorized associations 
using pattern recognition and other simplifying “rules of thumb” (i.e., 
heuristics). For example, a very simple pattern that could be useful 
in certain situations is “black woman plus hilar adenopathy equals 
sarcoid.” Because no effort is involved in recalling the pattern, the 
clinician is often unable to say how those judgments were formulated. 
In contrast, analysis (System 2), the other form of reasoning in the 
dual-process model, is slow, methodical, deliberative, and effortful. A 
student might read about causes of hilar adenopathy and from that list 
(e.g., Chap. 70), identify diseases more common in black women or 
examine the patient for skin or eye findings that occur with sarcoid. 
These dual processes, of course, represent two exemplars taken from 
the cognitive continuum. They provide helpful descriptive insights but 
very little guidance in how to develop expertise in clinical reasoning. 
How these idealized systems interact in different decision problems,

how experts use them differently from novices, and when their usage 
can lead to errors in judgment remain the subject of study and consid­
erable debate.
Pattern recognition, an important part of System 1 reasoning, is 
a complex cognitive process that appears largely effortless. One can 
recognize people’s faces, the breed of a dog, an automobile model, or 
a piece of music from just a few notes within milliseconds without 
necessarily being able to articulate the specific features that prompted 
the recognition. Analogously, experienced clinicians often recognize 
familiar diagnostic patterns very quickly. The key here is having a large 
library of stored patterns that can be rapidly accessed. In the absence 
of an extensive stored repertoire of diagnostic patterns, students (as 
well as experienced clinicians operating outside their area of expertise 
and familiarity) often must use the more laborious System 2 analytic 
approach along with more intensive and comprehensive data collection 
to reach the diagnosis.
The following brief patient scenarios illustrate three distinct pat­
terns associated with hemoptysis that experienced clinicians recognize 
without effort:
• A 46-year-old man presents to his internist with a chief complaint 
of hemoptysis. An otherwise healthy, nonsmoker, he is recovering 
from an apparent viral bronchitis. This presentation pattern sug­
gests that the small amount of blood-streaked sputum is due to acute 
bronchitis, so that a chest x-ray provides sufficient reassurance that 
a more serious disorder is absent.
• In the second scenario, a 46-year-old patient who has the same chief 
complaint but with a 100-pack-year smoking history, a productive 
morning cough with blood-streaked sputum, and weight loss fits 
the pattern of carcinoma of the lung. Consequently, along with the 
chest x-ray, the clinician obtains a sputum cytology examination 
and refers this patient for a chest computed tomography (CT) scan.
• In the third scenario, the clinician hears a soft diastolic rumbling 
murmur at the apex on cardiac auscultation in a 46-year-old patient 
with hemoptysis who immigrated from a developing country and 
orders an echocardiogram as well, because of possible pulmonary 
hypertension from suspected rheumatic mitral stenosis.
Pattern recognition by itself is not, however, sufficient for secure 
diagnosis. Without deliberative systematic reflection, undisciplined 
pattern recognition can result in premature closure: mistakenly jump­
ing to the conclusion that one has the correct diagnosis before all the 
relevant data are in. A critical second step, therefore, even when the 
diagnosis seems obvious, is diagnostic verification: considering whether 
the diagnosis adequately accounts for all of the presenting symptoms 
and signs and can explain all the ancillary findings. The following case 
based on a real clinical encounter provides an example of premature 
closure. A 45-year-old man presents with a 3-week history of a “flulike” 
upper respiratory infection (URI) including dyspnea and a productive 
cough. The emergency department (ED) clinician pulled out a “URI 
assessment form,” which defines and standardizes the information 
gathered. After quickly acquiring the requisite structured examination 
components and noting in particular the absence of fever and a clear 
chest examination, the physician prescribed a cough suppressant for 
acute bronchitis and reassured the patient that his illness was not seri­
ous. Following a sleepless night at home with significant dyspnea, the 
patient developed nausea and vomiting and collapsed. He was brought 
back to the ED in cardiac arrest and was unable to be resuscitated. 
His autopsy showed a posterior wall myocardial infarction (MI) and 
a fresh thrombus in an atherosclerotic right coronary artery. What 
went wrong? Presumably, the ED clinician felt that the patient was 
basically healthy (one can be misled by the way the patient appears on 
examination—a patient that does not appear “sick” may be incorrectly 
assumed to have an innocuous illness). So, in this case, the physician, 
upon hearing the overview of the patient from the triage nurse, elected 
to use the URI assessment protocol even before starting the history, 
closing consideration of the broader range of possibilities and associ­
ated tests required to confirm or refute these possibilities. Specifically, 
by concentrating on the abbreviated and focused URI protocol, the 
clinician failed to elicit the full dyspnea history, which was precipitated 

by exertion and accompanied by chest heaviness and relieved by rest, 
suggesting a far more serious disorder.

Heuristics or rules of thumb are a part of the intuitive system. These 
cognitive shortcuts provide a quick and easy path to reaching conclu­
sions and making choices, but when used improperly, they can lead 
to errors. Two major research programs have studied heuristics in a 
mostly nonmedical context and have reached different conclusions 
about the value of these cognitive tools. The “heuristics and biases” 
program focuses on how these mental shortcuts can lead to incorrect 
judgments. So far, however, little evidence exists that educating physi­
cians and other decision makers to watch for the >100 cognitive biases 
identified to date has had any effect on the rate of diagnostic errors. In 
contrast, the “fast and frugal heuristics” research program explores how 
and when relying on simple heuristics can produce good decisions. 
Although many heuristics have relevance to clinical reasoning, only 
four will be mentioned here.
CHAPTER 4
Decision-Making in Clinical Medicine 
When diagnosing patients, clinicians usually develop diagnostic 
hypotheses based on the similarity of that patient’s symptoms, signs, 
and other data to their mental representations (memorized patterns) 
of the disease possibilities. In other words, clinicians pattern match 
to identify the diagnoses that share the most similar findings to the 
patient at hand. This cognitive shortcut is called the representativeness 
heuristic. Consider a patient with hypertension who has headache, 
palpitations, and diaphoresis. Given this classic presenting symptom 
triad suggesting pheochromocytoma, clinicians might judge pheochro­
mocytoma to be quite likely based on the representativeness heuristic. 
Doing so, however, would be incorrect given that other causes of 
hypertension are much more common than pheochromocytoma and 
this triad of symptoms can occur in patients who do not have it. Thus, 
clinicians using the representativeness heuristic may overestimate the 
likelihood of a particular disease based on the presence of represen­
tative symptoms and signs, failing to account for its low underlying 
prevalence (i.e., the prior, or pretest, probabilities). Conversely, atypi­
cal presentations of common diseases may lead to underestimating the 
likelihood of a particular disease. Thus, inexperience with a specific 
disease and with the breadth of its presentations may also lead to diag­
nostic delays or errors, e.g., diseases that affect multiple organ systems, 
such as sarcoid or tuberculosis, may be particularly challenging to 
diagnose because of the many different patterns they may manifest.
A second commonly used cognitive shortcut, the availability heu­
ristic, involves judgments based on how easily prior similar cases or 
outcomes can be brought to mind. For example, a clinician may recall 
a case from a morbidity and mortality conference in which an elderly 
patient presented with painless dyspnea of acute onset and was evalu­
ated for a pulmonary cause but was eventually found to have acute MI, 
with the diagnostic delay likely contributing to the development of 
ischemic cardiomyopathy. If the case was associated with a malpractice 
accusation, such examples may be even more memorable. Errors with 
the availability heuristic arise from several sources of recall bias. Rare 
catastrophic outcomes become memorable cases with a clarity and 
force disproportionate to their likelihood for future diagnosis—for 
example, a patient with a sore throat eventually found to have leu­
kemia or a young athlete with leg pain subsequently found to have an 
osteosarcoma—and those publicized in the media or recently experi­
enced are, of course, easier to recall and therefore more influential on 
clinical judgments.
The third commonly used cognitive shortcut, the anchoring heu­
ristic (also called conservatism or stickiness), involves insufficiently 
adjusting the initial probability of disease up (or down) following a 
positive (or negative test) when compared with Bayes’ theorem, i.e., 
sticking to the initial diagnosis. For example, a clinician may still judge 
the probability of coronary artery disease (CAD) to be high despite a 
negative exercise perfusion test and go on to cardiac catheterization 
(see “Measures of Disease Probability and Bayes’ Rule,” below).
The fourth heuristic states that clinicians should use the simplest 
explanation possible that will adequately account for the patient’s 
symptoms and findings (Occam’s razor or, alternatively, the simplic­
ity heuristic). Although this is an attractive and often used principle, 
it is important to remember that no biologic basis for it exists. Errors

from the simplicity heuristic include premature closure leading to the 
neglect of unexplained significant symptoms or findings.

For complex or unfamiliar diagnostic problems, clinicians typi­
cally resort to analytic reasoning processes (System 2) and proceed 
methodically using the hypothetico-deductive model of reasoning. Based 
on the patient’s stated reasons for seeking medical attention, clinicians 
develop an initial list of diagnostic possibilities in hypothesis generation. 
During the history of the present illness, the initial hypotheses evolve 
in diagnostic refinement as emerging information is tested against the 
mental models of the diseases being considered with diagnoses increas­
ing and decreasing in likelihood or even being dropped from or added 
to consideration as the working hypotheses of the moment. These 
mental models often generate additional questions that distinguish the 
diagnostic possibilities from one another. The focused physical exami­
nation contributes to further distinguishing the working hypotheses. 
Is the spleen enlarged? How big is the liver? Is it tender? Are there any 
palpable masses or nodules? Diagnostic verification involves testing the 
adequacy (whether the diagnosis accounts for all symptoms and signs) 
and coherency (whether the signs and symptoms are consistent with 
the underlying pathophysiologic causal mechanism) of the working 
diagnosis. For example, if the enlarged and quite tender liver felt on 
physical examination is due to acute hepatitis (the hypothesis), then 
certain specific liver function tests will be markedly elevated (the pre­
diction). Should the tests come back normal, the hypothesis may have 
to be discarded and others reconsidered.
PART 1
The Profession of Medicine
Although often neglected, negative findings are as important as 
positive ones because they reduce the likelihood of the diagnostic 
hypotheses under consideration. Chest discomfort that is not provoked 
or worsened by exertion and not relieved by rest in an active patient 
lowers the likelihood that chronic ischemic heart disease is the under­
lying cause. The absence of a resting tachycardia and thyroid gland 
enlargement reduces the likelihood of hyperthyroidism in a patient 
with paroxysmal atrial fibrillation.
The acuity of a patient’s illness may override considerations of 
prevalence and the other issues described above. “Diagnostic impera­
tives” recognize the significance of relatively rare but potentially 
catastrophic conditions if undiagnosed and untreated. For example, 
clinicians should consider aortic dissection routinely as a possible 
cause of acute severe chest discomfort. Although the typical present­
ing symptoms of dissection differ from those of MI, dissection may 
mimic MI, and because it is far less prevalent and potentially fatal if 
mistreated, diagnosing dissection remains a challenging diagnostic 
imperative (Chap. 291). Clinicians taking care of acute, severe chest 
pain patients should explicitly and routinely inquire about symptoms 
suggestive of dissection, measure blood pressures in both arms for dis­
crepancies, and examine for pulse deficits. When these are all negative, 
clinicians may feel sufficiently reassured to discard the aortic dissec­
tion hypothesis. If, however, the chest x-ray shows a possible widened 
mediastinum, the hypothesis should be reinstated and an appropriate 
imaging test ordered (e.g., thoracic CT angiography or transesophageal 
echocardiogram). In nonacute situations, the prevalence of potential 
alternative diagnoses should play a much more prominent role in diag­
nostic hypothesis generation.
Cognitive scientists studying the thought processes of expert clini­
cians have observed that clinicians group data into packets, or “chunks,” 
that are stored in short-term or “working memory” and manipulated to 
generate diagnostic hypotheses. Because short-term memory is limited 
(classically humans can accurately repeat a list of 7 ± 2 numbers read 
to them), the number of diagnoses that can be actively considered in 
hypothesis-generating activities is similarly limited. For this reason, the 
cognitive shortcuts discussed above play a key role in the generation 
of diagnostic hypotheses, many of which are discarded as rapidly as 
they are formed, thereby demonstrating that the distinction between 
analytic and intuitive reasoning is an arbitrary and simplistic, but 
nonetheless useful, representation of cognition.
Research into the hypothetico-deductive model of reasoning has 
had difficulty identifying the elements of the reasoning process 
that distinguish experts from novices. This has led to a shift from 
examining the problem-solving process of experts to analyzing the 

organization of their knowledge for pattern matching as exemplars, 
prototypes, and illness scripts. For example, diagnosis may be based on 
the resemblance of a new case to patients seen previously (exemplars). 
As abstract mental models of disease, prototypes incorporate the likeli­
hood of various disease features. Illness scripts include risk factors, 
pathophysiology, and symptoms and signs. Experts have a much larger 
store of exemplar and prototype cases, an example of which is the visual 
long-term memory of experienced radiologists. However, clinicians do 
not simply rely on literal recall of specific cases but have constructed 
elaborate conceptual networks of memorized information or models of 
disease to aid in arriving at their conclusions (illness scripts). That is, 
expertise involves an enhanced ability to connect symptoms, signs, and 
risk factors to one another in meaningful ways; relate those findings to 
possible diagnoses; and identify the additional information necessary 
to confirm the diagnosis.
No single theory accounts for all the key features of expertise in 
medical diagnosis. Experts have more knowledge about presenting 
symptoms of diseases and a larger repertoire of cognitive tools to 
employ in problem solving than nonexperts. One definition of exper­
tise highlights the ability to make powerful distinctions. In this sense, 
expertise involves a working knowledge of the diagnostic possibilities 
and those features that distinguish one disease from another. Memo­
rization alone is insufficient, e.g., photographic memory of a medical 
textbook would not make one an expert. But having access to detailed 
case-specific relevant information is critically important. In the past, 
clinicians primarily acquired clinical knowledge through their patient 
experiences, but now clinicians have access to a plethora of information 
sources. Clinicians of the future will be able to leverage the experiences 
of large numbers of other clinicians using electronic tools, but, as with 
the memorized textbook, the data alone will be insufficient to create 
expertise.
Despite all the research seeking to understand expertise in medi­
cine and other disciplines, it remains uncertain whether any didactic 
program can actually accelerate the progression from novice to expert 
or from experienced clinician to master clinician. Deliberate effortful 
practice (over an extended period of time, sometimes said to be 10 years 
or 10,000 practice hours) and personal coaching are two strategies 
often used outside medicine (e.g., music, athletics, chess) to develop 
expertise. Their use in the context of medical practice has not yet been 
adequately explored. Some studies in medicine suggest that the most 
beneficial approach to education exposes students to both the signs 
and symptoms of specific diseases (disease pattern recognition) and, in 
addition, the lists of diseases that can present with specific symptoms 
and signs (differential diagnosis). Active learning opportunities useful 
for those in training include developing a personal learning system, 
e.g., systematically reflecting on diagnostic processes used (metacogni­
tion) and following up to identify diagnoses and treatments for patients 
in their care.
■
■PERSONALIZED DECISION-MAKING
The modern ideal of medical therapeutic decision-making is to “per­
sonalize” treatment recommendations. In the abstract, personalizing 
treatment involves combining the best available evidence about what 
works with an individual patient’s unique features (e.g., risk factors, 
genomics, and comorbidities) and their preferences and health goals to 
craft an optimal treatment recommendation with the patient. Opera­
tionally, two different and complementary levels of personalization are 
possible: individualizing the risk of harm and benefit for the options 
being considered based on the specific patient characteristics (preci­
sion medicine) and personalizing the therapeutic decision process 
by incorporating the patient’s preferences and values for the possible 
health outcomes. This latter process is sometimes referred to as shared 
decision-making and typically involves clinicians sharing their knowl­
edge about the options and the associated consequences and trade-offs 
and patients sharing their health goals (e.g., avoiding a short-term risk 
of dying from coronary artery bypass grafting to see their grandchild 
get married in a few months).
Individualizing the evidence about therapy does not mean relying 
on physician impressions of benefit and harm from their personal

experience. Because of nonrandom selection, small sample sizes, and 
rare events, the chance of drawing erroneous causal inferences from 
one’s own clinical experience is very high. For most chronic diseases, 
the treatment response is a counterfactual concept, only demonstrable 
statistically in large patient populations. Because of this, it would be 
incorrect to infer with any certainty, for example, that treating a hyper­
tensive patient with angiotensin-converting enzyme (ACE) inhibitors 
necessarily prevented a stroke from occurring during treatment, or that 
an untreated patient would definitely have avoided their stroke had 
they been treated. For many chronic diseases, a majority of patients 
will remain event free over long periods of time regardless of treatment 
choices; some will have events regardless of which treatment is selected; 
and those who avoided having an event through treatment cannot be 
individually identified. Blood pressure lowering, a readily observable 
surrogate endpoint, does not have a tightly coupled relationship with 
strokes prevented. Consequently, in most situations, demonstrating 
therapeutic effectiveness cannot rely simply on observing the outcome 
of an individual patient but should instead be based on large groups of 
patients carefully studied and properly analyzed.
Therapeutic decision-making, therefore, should be based on the best 
available evidence from clinical trials and well-done outcome studies. 
Trustworthy clinical practice guidelines that synthesize such evidence 
offer normative guidance for many testing and treatment decisions. 
However, all guidelines recognize that “one size fits all” recommenda­
tions may not apply to individual patients. Increased research into the 
heterogeneity of treatment effects seeks to understand how best to 
adjust group-level clinical evidence of treatment harms and benefits to 
account for the absolute level of risks faced by subgroups and even by 
individual patients, using, for example, validated clinical risk scores.
■
■NONCLINICAL INFLUENCES ON CLINICAL 
DECISION-MAKING
More than three decades of research on variations in clinician prac­
tice patterns have identified important nonclinical forces that shape 
clinical decisions. These factors can be grouped conceptually into three 
overlapping categories: (1) factors related to an individual physician’s 
practice, (2) factors related to practice setting, and (3) factors related 
to payment systems.
Practice Style 
To ensure that necessary care is provided at a high 
level of quality, physicians fulfill a key role in medical care by serv­
ing as the patient’s advocate. Factors that influence performance in 
this role include the physician’s knowledge, training, and experience. 
Clearly, physicians cannot practice EBM if they are unfamiliar 
with the evidence. As would be expected, specialists generally know 
the evidence in their field better than do generalists. Beyond published 
evidence and practice guidelines, a major set of influences on physi­
cian practice can be subsumed under the general concept of “practice 
style.” The practice style serves to define norms of clinical behavior. 
Differing practice styles may be based on training, personal experi­
ence, and medical evidence. Beliefs about effectiveness of different 
therapies and preferred patterns of diagnostic test use are examples of 
different facets of a practice style. For example, cardiologists evaluating 
patients with lower risk chest pain symptoms often conceptualize their 
primary diagnostic objective as maximizing the detection of ischemia. 
For this reason, they may strongly favor stress imaging. Internists car­
ing for the same patients may be more comfortable with initial use of 
exercise electrocardiogram (ECG) testing without imaging. This latter 
practice style focuses less on ischemia detection and more on follow­
ing guideline recommendations that indicate no outcome advantage 
for stress imaging in this context. Cardiologists, relative to general 
internists, may also favor a more liberal use of coronary angiography 
and revascularization in patients with stable ischemic symptoms, i.e., 
the “oculostenotic reflex.”
Beyond the patient’s welfare, physician perceptions about the risk 
of a malpractice suit resulting from either an erroneous decision or a 
bad outcome may drive clinical decisions and create a practice referred 
to as defensive medicine. This practice involves ordering tests and 
therapies with very small marginal benefits, ostensibly to preclude 

future criticism should an adverse outcome occur. Over time, such 
patterns of care may become accepted as part of the practice norm, 
thereby perpetuating their overuse, e.g., annual cardiac exercise testing 
in asymptomatic patients.

CHAPTER 4
Practice Setting 
Factors in this category relate to work systems 
including tasks and workflow (e.g., interruptions, inefficiencies, work­
load), technology (e.g., electronic health record design or implemen­
tation issue,), organizational characteristics (e.g., culture, leadership, 
staffing, scheduling), and the physical environment (e.g., noise, light­
ing, layout). Physician-induced demand is a term that refers to the 
repeated observation that once medical facilities and technologies 
become available to physicians, they will find ways to use them. Other 
environmental factors that can influence decision-making include 
the local availability of specialists for consultations and procedures; 
“high-tech” advanced imaging or procedure facilities such as magnetic 
resonance imaging (MRI) machines and proton beam therapy centers; 
and fragmentation of care.
Decision-Making in Clinical Medicine 
Payment Systems 
Economic incentives are closely related to the 
other two categories of practice-modifying factors. Financial issues can 
exert both stimulatory and inhibitory influences on clinical practice. 
Historically, physicians have been paid on a fee-for-service, capitation, 
or salary basis. In fee-for-service, physicians who do more generally get 
paid more, thereby encouraging overuse, consciously or unconsciously. 
When fees are reduced (discounted reimbursement), clinicians tend 
to increase the number of services provided to maintain revenue. 
Capitation, in contrast, provides a fixed payment per patient per year 
to encourage physicians to consider a global population budget in 
managing individual patients and ideally reducing the use of interven­
tions with small marginal benefit. In recognition of the unsustainability 
of continued growth in medical expenditures and the opportunity 
costs associated with that (funds that might be more beneficially 
applied to education, energy, social welfare, or defense), current efforts 
seek to transition to a value-based payment system to reduce overuse 
and to reflect benefit. Work to define how to tie payment to value has 
mostly focused so far on “pay for performance” models. High-quality 
clinical trial evidence for the effectiveness of these models is still mostly 
lacking.
■
■DIAGNOSTIC TEST PERFORMANCE: 
UNDERSTANDING TEST ACCURACY
The purpose of performing a test on a patient is to reduce uncertainty 
about the patient’s diagnosis or prognosis to facilitate appropriate 
management. Although diagnostic tests commonly refer to laboratory 
(e.g., blood count) or imaging tests or procedures (e.g., colonoscopy or 
bronchoscopy), any information that changes a clinician’s understand­
ing of the patient’s problem qualifies as a diagnostic test. Thus, even the 
history and physical examination can be considered as diagnostic tests. 
In clinical medicine, it is common to reduce the results of a test to a 
dichotomous outcome, such as positive or negative, normal or abnor­
mal. Although this simplification often suppresses useful information 
(such as the degree of abnormality), it facilitates illustrating some 
important principles of test interpretation that are described below.
The accuracy of any diagnostic test is best assessed relative to a “gold 
standard,” where a positive gold standard test defines the patients who 
have disease and a negative test securely rules out disease (Table 4-1). 
Characterizing the diagnostic performance of a new test requires 
identifying an appropriate population (ideally, patients representative 
of those in whom the new test would be used) and applying both the 
new and the gold standard tests to all subjects. Biased estimates of 
test performance occur when diagnostic accuracy is defined using an 
inappropriate population or one in which gold standard determina­
tion of disease status is incomplete. The accuracy of the new test in 
distinguishing disease from health is determined relative to the gold 
standard results and summarized in four estimates. The sensitivity or 
true-positive rate reflects how well the new test identifies patients with 
disease. It is the proportion of patients with disease (defined by the 
gold standard) who have a positive test. The proportion of patients with 
disease who have a negative test is the false-negative rate, calculated as

TABLE 4-1  Measures of Diagnostic Test Accuracy
 
DISEASE STATUS 
TEST RESULT
PRESENT
ABSENT
Positive
True positives (TP)
False positives (FP)
Negative
False negatives (FN)
True negatives (TN)
PART 1
The Profession of Medicine
Test Characteristics in Patients with Disease
True-positive rate (sensitivity) = TP/(TP + FN)
False-negative rate = FN/(TP + FN) = 1 – true-positive rate
Test Characteristics in Patients without Disease
True-negative rate (specificity) = TN/(TN + FP)
False-positive rate = FP/(TN + FP) = 1 – true-negative rate
1 – sensitivity. The specificity, or true-negative rate, reflects how well 
the new test correctly identifies patients without disease. It is the pro­
portion of patients without disease (defined by the gold standard) who 
have a negative test. The proportion of patients without disease who 
have positive test is the false-positive rate, calculated as 1 – specificity. 
A theoretically perfect test then would have a sensitivity of 100% and 
a specificity of 100% and would completely distinguish patients with 
disease from those without it. A useful mnemonic to help remember 
the somewhat paradoxical relationship between what the test is best at 
technically versus what it is most useful for clinically is: a test with a 
very high sensitivity (Sn) when negative (N) helps rule out (out) disease 
(SnNout), and a test with a very high specificity (Sp) when positive (P) 
helps rule in (in) disease (SpPin).
Calculating sensitivity and specificity requires selection of a thresh­
old value or cut point above which the test is considered “positive.” 
Making the cut point “stricter” (e.g., raising it) lowers sensitivity but 
improves specificity, while making it “laxer” (e.g., lowering it) raises 
sensitivity but lowers specificity. This dynamic trade-off between more 
accurate identification of patients with disease versus those without 
disease is often displayed graphically as a receiver operating charac­
teristic (ROC) curve (Fig. 4-1) by plotting sensitivity (y axis) versus 
1 – specificity (x axis). Each point on the curve represents a potential 
cut point with an associated sensitivity and specificity value. The area 
under the ROC curve often is used as a quantitative measure of the 
information content of a test. Values range from 0.5 (no diagnostic 
information from testing at all; the test is equivalent to flipping a coin) 
to 1.0 (perfect test). The choice of cut point should ideally reflect the 
relative harms and benefits of treatment for those without versus those 
with disease. For example, if treatment was safe with substantial ben­
efit, then choosing a high-sensitivity cut point (upper right of the ROC 
curve) for a low-risk test may be appropriate (e.g., phenylketonuria in 
newborns), but if treatment had substantial risk for harm, then choos­
ing a high-specificity cut point (lower left of the ROC curve) may be 
appropriate (e.g., chemotherapy for cancer). The choice of cut point 
may also depend on the prevalence of disease, with low prevalence 
placing a greater emphasis on the harms of false-positive tests (e.g., 
HIV testing in marriage applicants) or the harms of false-negative tests 
(e.g., HIV testing in blood donors).
■
■MEASURES OF DISEASE PROBABILITY AND 
BAYES’ RULE
In the absence of perfect tests, the true disease state of the patient 
remains uncertain after every test. Bayes’ rule provides a way to quan­
tify the revised uncertainty using simple probability mathematics (and 
thereby avoid anchoring bias). It calculates the posttest probability, 
or likelihood of disease after a test result, from three parameters: the 
pretest probability of disease, the test sensitivity, and the test specificity. 
The pretest probability is a quantitative estimate of the likelihood of the 
diagnosis before the test is performed and is usually estimated from 
the prevalence of the disease in the underlying population (if known) 
or clinical context (e.g., age, sex, and type of chest pain). For some 
common conditions, such as CAD, existing nomograms and statistical 
models generate estimates of pretest probability that account for his­
tory, physical examination, and test findings. The posttest probability 

0.9
0.8
0.7
True-positive rate
0.6
0.5
0.4
0.3
0.2
Good
Fair
No predictive value
0.1

0.1
0.2
0.3
0.4
False-positive rate
0.5
0.6
0.7
0.8
0.9

FIGURE 4-1  Each receiver operating characteristic (ROC) curve illustrates a tradeoff that occurs between improved test sensitivity (accurate detection of patients 
with disease) and improved test specificity (accurate detection of patients without 
disease), as the test value defining when the test turns from “negative” to “positive” 
is varied. A 45° line would indicate a test with no predictive value (sensitivity = 
specificity at every test value). The area under each ROC curve is a measure of 
the information content of the test. Thus, a larger ROC area signifies increased 
diagnostic accuracy.
(also called the predictive value of the test, see below) is a recalibrated 
statement of the probability of the diagnosis, accounting for both pre­
test probability and test results. For the probability of disease following 
a positive test (i.e., positive predictive value), Bayes’ rule is calculated as:
=
×
×
+
×
Posttest probability
Pretest probability
test sensitivity
Pretest probability
test sensitivity
(1–Pretest probability)
(false-positive test rate)
For example, consider a 64-year-old woman with atypical chest pain 
who has a pretest probability of 0.50 and a “positive” diagnostic test 
result (assuming test sensitivity = 0.90 and specificity = 0.90).
Posttest probability
(0.50)(0.90)
(0.50)(0.90) (0.50)(0.10)
0.90
=
+
=
The term predictive value has often been used as a synonym for the 
posttest probability. Unfortunately, clinicians commonly misinterpret 
reported predictive values as intrinsic measures of test accuracy rather 
than calculated probabilities. Studies of diagnostic test performance 
compound the confusion by calculating predictive values from the 
same sample used to measure sensitivity and specificity. Such calcula­
tions are misleading unless the test is applied subsequently to popula­
tions with exactly the same disease prevalence. For these reasons, the 
more descriptive term, posttest probability following a positive or a 
negative test, is preferred over predictive value.
The nomogram version of Bayes’ rule (Fig. 4-2) helps us to under­
stand at a conceptual level how it estimates the posttest probability of 
disease. In this nomogram, the impact of the diagnostic test result is 
summarized by the likelihood ratio, which is defined as the ratio of 
the probability of a given test result (e.g., “positive” or “negative”) in a

0.1
0.1
0.5
0.2

0.5
0.5

0.5

0.05
0.1
0.2

0.02

0.01

0.5

0.2

0.1

Pretest
Probability, %
Posttest
Probability, %
Likelihood
Ratio
Pretest
Probability, %
Posttest
Probability, %
Likelihood
Ratio
FIGURE 4-2  Nomogram version of Bayes’ theorem used to predict the posttest probability of disease (right-hand scale) 
using the pretest probability of disease (left-hand scale) and the likelihood ratio for a positive or a negative test (middle 
scale). See text for information on calculation of likelihood ratios. To use, place a straightedge connecting the pretest 
probability and the likelihood ratio and read off the posttest probability. The right-hand part of the figure illustrates the 
value of a positive exercise treadmill test (likelihood ratio 4, green line) and a positive exercise thallium single-photon 
emission CT perfusion study (likelihood ratio 9, broken brown line) in a patient with a pretest probability of coronary artery 
disease of 50%. (Adapted from Centre for Evidence-Based Medicine: Likelihood ratios. Available at http://www.cebm.net/
likelihood-ratios/.)
patient with disease to the probability of that result in a patient without 
disease, thereby providing a measure of how well the test distinguishes 
those with from those without disease.
The likelihood ratio for a positive test is calculated as the ratio of the 
true-positive rate to the false-positive rate (or sensitivity/[1 – specificity]). 
For example, a test with a sensitivity of 0.90 and a specificity of 0.90 
has a likelihood ratio of 0.90/(1 – 0.90), or 9. Thus, for this hypotheti­
cal test, a “positive” result is 9 times more likely in a patient with the 
disease than in a patient without it. Most tests in medicine have likeli­
hood ratios for a positive result between 1.5 and 20. Higher values 
are associated with tests that more substantially increase the posttest 
likelihood of disease. A very high likelihood ratio positive (>10) usually 
implies high specificity, so a positive high-specificity test helps “rule 
in” disease (the “SpPin” mnemonic introduced earlier). If sensitivity is 
excellent but specificity is less so, the likelihood ratio positive will be 
reduced substantially (e.g., with a 90% sensitivity but a 55% specificity, 
the likelihood ratio positive is 2.0).
The corresponding likelihood ratio for a negative test is the ratio of 
the false-negative rate to the true-negative rate (or [1 – sensitivity]/
specificity). Lower likelihood ratio negative values more substan­
tially lower the posttest likelihood of disease. A very low likelihood 
ratio negative (falling below 0.10) usually implies high sensitivity, so 
a negative high-sensitivity test helps “rule out” disease (the SnNout 
mnemonic). The hypothetical test considered above with a sensitivity 
of 0.9 and a specificity of 0.9 would have a likelihood ratio for a nega­
tive test result of (1 – 0.9)/0.9, or 0.11, meaning that a negative result 

is about one-tenth as likely in patients 
with disease than in those without dis­
ease (or about 10 times more likely in 
those without disease than in those 
with disease).

CHAPTER 4

■
■APPLICATIONS TO 
DIAGNOSTIC TESTING 

IN CAD
Consider two tests commonly used 
in the diagnosis of CAD: an exercise 
treadmill and an exercise single-photon 
emission CT (SPECT) myocardial per­
fusion imaging test (Chap. 248). A 
positive treadmill ST-segment response 
has an average sensitivity of ~60% and 
an average specificity of ~75%, yielding 
a likelihood ratio positive of 2.4 (0.60/
[1 – 0.75]) (consistent with modest 
discriminatory ability because it falls 
between 2 and 5). A 41-year-old man 
with atypical chest pain and no other 
risk factors has about a 10% pretest 
probability of CAD. After a positive 
result, the posttest probability of disease 
rises to only ~30%. For a 60-year-old 
man with typical angina and multiple 
risk factors, the pretest probability of 
CAD is about 80%. After a positive test 
result, the posttest probability of disease 
rises to ~95%.

Decision-Making in Clinical Medicine 

0.5
0.05
0.1
0.2

0.02
0.01

0.5
0.2
In contrast, exercise SPECT myo­
cardial perfusion test is more accurate 
for diagnosis of CAD. For simplicity, 
assume that the finding of a revers­
ible exercise-induced perfusion defect 
has both a sensitivity and a specificity 
of 90% (a bit higher than reported), 
yielding a likelihood ratio for a positive 
test of 9.0 (0.90/[1 – 0.90]) (consistent 
with intermediate discriminatory abil­
ity because it falls between 5 and 10). 
For the same 10% pretest probability 
patient, a positive test raises the probability of CAD to 50% (Fig. 4-2). 
However, despite the differences in posttest probabilities between these 
two tests (30 vs 50%), the more accurate test may not improve diag­
nostic likelihood enough to change patient management (e.g., decision 
to refer to cardiac catheterization) because the more accurate test has 
only moved the physician from being fairly certain that the patient did 
not have CAD to a 50:50 chance of disease. In a patient with a pretest 
probability of 80%, exercise SPECT test raises the posttest probability 
to 97% (compared with 95% for the exercise treadmill). Again, the 
more accurate test does not provide enough improvement in posttest 
confidence to alter management, and neither test has improved much 
on what was known from clinical data alone.
0.1
In general, positive results with an accurate test (e.g., likelihood 
ratio for a positive test of 10) when the pretest probability is low (e.g., 
20%) do not move the posttest probability to a range high enough to 
rule in disease (e.g., 80%). In screening situations, pretest probabilities 
are often particularly low because patients are asymptomatic. In such 
cases, specificity becomes especially important. For example, in screen­
ing first-time female blood donors without risk factors for HIV, a posi­
tive test raised the likelihood of HIV to only 67% despite a specificity 
of 99.995% because the prevalence was 0.01%. Conversely, with a high 
pretest probability, a negative test may not rule out disease adequately 
if it is not sufficiently sensitive. Thus, the largest change in diagnostic 
likelihood following a test result occurs when the clinician is most 
uncertain (i.e., pretest probability between 30 and 70%). For example, 
a 70-year-old woman with typical angina and multiple risk factors has a

pretest probability for CAD of ~50%. A positive exercise treadmill test 
moves the posttest probability to 80%, and a positive exercise SPECT 
perfusion test moves it to 90% (Fig. 4-2).

As presented above, Bayes’ rule employs a number of important 
simplifications that should be considered. First, few tests provide only 
“positive” or “negative” results. Many tests have multidimensional out­
comes (e.g., extent of ST-segment depression, exercise duration, and 
exercise-induced symptoms with exercise testing). Although Bayes’ 
theorem can be adapted to this more detailed test result format, it 
is computationally more complex to do so. Similarly, when multiple 
sequential tests are performed, the posttest probability may be used 
as the pretest probability to interpret the second test. However, this 
simplification assumes conditional independence—that is, that the 
results of the first test do not affect the likelihood of the second test 
result—and this is often not true.
PART 1
The Profession of Medicine
Finally, many texts assert that sensitivity and specificity are 
prevalence-independent parameters of test accuracy. This statistically 
useful assumption, however, is often incorrect. A treadmill exercise 
test, for example, has a sensitivity of ~30% in a population of patients 
with one-vessel CAD, whereas its sensitivity in patients with severe 
three-vessel CAD approaches 80%. Thus, the best estimate of sensitiv­
ity to use in a particular decision may vary, depending on the severity 
of disease in the local population. A hospitalized, symptomatic, or 
referral population typically has a higher prevalence of disease and, 
importantly, a higher prevalence of more advanced disease than does 
an outpatient population. Consequently, test sensitivity will likely be 
higher in hospitalized patients and test specificity higher in outpatients.
■
■RISK PREDICTION MODELS
Bayes’ rule, when used as presented above, is useful in studying diag­
nostic testing concepts, but predictions based on multivariable statisti­
cal models can more accurately address these more complex problems 
by simultaneously accounting for additional relevant patient charac­
teristics. These models explicitly account for multiple, even possibly 
overlapping, pieces of patient-specific information and assign a relative 
weight to each on the basis of its unique independent contribution to 
the prediction in question. For example, a logistic regression model to 
predict the probability of CAD ideally considers all the relevant inde­
pendent factors from the clinical examination and diagnostic testing 
and their relative importance instead of the limited data that clinicians 
can manage in their heads or with Bayes’ rule. However, despite this 
strength, prediction models are usually too complex computationally 
to use without a calculator or computer. Guideline-driven treatment 
recommendations based on statistical prediction models available 
online, e.g., the American College of Cardiology/American Heart 
Association risk calculator for primary prevention with statins and 
the CHA2DS2-VASC calculator for anticoagulation for atrial fibrilla­
tion, have generated more widespread usage. Some predictive models 
are now embedded into electronic health record (EHR) systems, most 
commonly addressing issues related to thrombosis/anticoagulation 
and to sepsis. Evidence about the impact of these EHR-based models 
on patient outcomes is mostly observational and suggests more work is 
needed to deliver the risk information to the right clinician at the right 
time in a way that supports clinical workflow.
One reason for limited clinical use is that, to date, only a handful of 
prediction models have been validated sufficiently (e.g., Wells criteria 
for pulmonary embolism; Table 4-2). The importance of independent 
validation in a population separate from the one used to develop the 
model cannot be overstated. An unvalidated risk prediction model 
should be viewed with the skepticism appropriate for any new drug or 
medical device that has not had rigorous clinical trial testing.
When statistical survival models in cancer and heart disease have 
been compared directly with clinicians’ predictions, the survival mod­
els have been found to be more consistent, as would be expected, but 
not always more accurate. On the other hand, comparison of clinicians 
with websites and apps that generate lists of possible diagnoses to 
help patients with self-diagnosis found that physicians outperformed 
the currently available programs. For students and less-experienced 
clinicians, the biggest value of diagnostic decision support may be in 

TABLE 4-2  Wells Clinical Prediction Rule for Pulmonary 

Embolism (PE)
CLINICAL FEATURE
POINTS
Clinical signs of deep-vein thrombosis

Alternative diagnosis is less likely than PE

Heart rate >100 beats/min
1.5
Immobilization ≥3 days or surgery in previous 4 weeks
1.5
History of deep-vein thrombosis or PE
1.5
Hemoptysis

Malignancy (with treatment within 6 months) or palliative

INTERPRETATION
 
Score >6.0
High
Score 2.0–6.0
Intermediate
Score <2.0
Low
extending diagnostic possibilities and triggering “rational override,” 
but their impact on knowledge, information-seeking, and problemsolving needs additional research.
FORMAL DECISION SUPPORT TOOLS
■
■DECISION SUPPORT SYSTEMS AND ARTIFICIAL 
INTELLIGENCE
Over the past 50 years, many attempts have been made to develop 
computer systems to aid clinical decision-making and patient man­
agement. Conceptually, computers offer several levels of potentially 
useful support for clinicians. At the most basic level, they provide 
ready access to vast reservoirs of information, which may, however, be 
quite difficult to sort through to find what is needed. At higher levels, 
computers can support care management decisions by making accu­
rate predictions of outcome, or can simulate the whole decision pro­
cess, and provide algorithmic guidance. Computer-based predictions 
using Bayesian or statistical regression models inform a clinical deci­
sion but do not actually reach a “conclusion” or “recommendation.” 
Recent advances in artificial intelligence (AI) suggest that medicine is 
on the threshold of developing much more powerful digital tools, but 
current enthusiasm for such tools still exceeds demonstrated utility in 
clinical care. Work on AI dates back to the 1950s and can be separated 
into three major subtypes: neural networks, machine learning (and its 
subtype deep learning), and generative AI. Machine learning methods 
are being applied to pattern recognition tasks such as the examination 
of skin lesions and the interpretation of x-rays. Generative AI (AI 
models that generate new content) is a term that covers several differ­
ent types of systems, including large language models (which gener­
ate language-based content; an example is GPT-4). Large language 
models offer some promise in helping to create clinical notes. Their 
use in support of clinical decision-making, however, is still at a very 
preliminary stage with need for independent validation in a popula­
tion separate from the one used to develop the model. Early evidence 
suggests that clinicians are willing to rely on AI-based tools even 
when the information provided is clearly inaccurate or contradictory. 
Concerns about model confabulation and the potential for patient 
harms mandate careful and comprehensive testing before AI tools are 
implemented in clinical care.
Reminder or protocol-directed systems do not make predictions 
but use existing algorithms, such as guidelines or appropriate utiliza­
tion criteria, to direct clinical practice. In general, however, decision 
support systems have so far had little impact on practice. Reminder 
systems built into EHRs have shown the most promise, particularly in 
correcting drug dosing and promoting adherence to guidelines. Check­
lists may also help avoid or reduce errors.
■
■DECISION ANALYSIS
Compared with the decision support methods discussed earlier, 
decision analysis represents a normative prescriptive approach to 
decision-making in the face of uncertainty. Its principal application

is in complex decisions. For example, public health policy decisions 
often involve trade-offs in length versus quality of life, benefits versus 
resource use, population versus individual health, and uncertainty 
regarding efficacy, effectiveness, and adverse events as well as values or 
preferences regarding mortality and morbidity outcomes.
One recent analysis using this approach involved the optimal 
screening strategy for breast cancer, which has remained controversial, 
in part because a randomized controlled trial to determine when to 
begin screening and how often to repeat screening mammography is 
impractical. In 2016, the National Cancer Institute–sponsored Cancer 
Intervention and Surveillance Network (CISNET) examined eight 
strategies differing by whether to initiate mammography screening at 
age 40, 45, or 50 years and whether to screen annually, biennially, or 
annually for women in their forties and biennially thereafter (hybrid). 
The six simulation models found biennial strategies to be the most 
efficient for average-risk women. Biennial screening for 1000 women 
from age 50–74 years versus no screening avoided seven breast cancer 
deaths. Screening annually from age 40–74 years avoided three addi­
tional deaths but required 20,000 additional mammograms and yielded 
1988 more false-positive results. Factors that influenced the results 
included patients with a two- to fourfold higher risk for developing 
breast cancer in whom annual screening from age 40–74 years yielded 
similar benefits as biennial screening from age 50–74. For average-risk 
patients with moderate or severe comorbidities, screening could be 
stopped earlier, at age 66–68 years.
This analysis involved six models that reproduced epidemiologic 
trends and a screening trial result, accounted for digital technology and 
treatments advances, and considered quality of life, risk factors, breast 
density, and comorbidity. It provided novel insights into a public health 
problem in the absence of randomized clinical trials examining alter­
native start age, stop age, and screening frequencies, and helped weigh 
the pros and cons of such a health policy recommendation. Although 
such models have been developed for selected clinical problems, their 
benefit and application to individual real-time clinical management 
has yet to be demonstrated.
DIAGNOSIS AS AN ELEMENT OF 

QUALITY OF CARE
High-quality medical care begins with accurate diagnosis. The inci­
dence of diagnostic errors has been estimated by a variety of methods 
including postmortem examinations, medical record reviews, and 
medical malpractice claims, with each yielding complementary but 
different estimates of this quality of care patient-safety problem. In the 
past, diagnostic errors tended to be viewed as a failure of individual 
clinicians. The modern view is that they are mostly a system of care 
deficiencies. Current estimates suggest that nearly everyone will expe­
rience at least one diagnostic error in their lifetime, leading to mortal­
ity, morbidity, unnecessary tests and procedures, costs, and anxiety.
Solutions to the “diagnostic errors as a system of care” problem 
have focused on system-level approaches, such as decision support 
and other tools integrated into EHRs. The use of checklists has been 
proposed as a means of reducing some of the cognitive errors discussed 
earlier in the chapter, such as premature closure. While checklists have 
been shown to be useful in certain medical contexts, such as operating 
rooms and intensive care units, their value in preventing diagnostic 
errors that lead to patient adverse events remains to be shown.
EVIDENCE-BASED MEDICINE
Clinical medicine is defined traditionally as a practice combining 
medical knowledge (including scientific evidence), intuition, and judg­
ment in the care of patients (Chap. 1). EBM updates this construct by 
placing much greater emphasis on the processes by which clinicians 
gain knowledge of the most up-to-date and relevant clinical research 
to determine for themselves whether medical interventions alter the 
disease course and improve the length or quality of life. The phrase 
“evidence-based medicine” is now used so often and in so many differ­
ent contexts that many practitioners are unaware of its original mean­
ing. The intention of the EBM program, as described in the early 1990s 

by its founding proponents at McMaster University, becomes clearer 
through an examination of its four key steps:

1.	 Formulating the management question to be answered
2.	 Searching the literature and online databases for applicable research 
CHAPTER 4
data
3.	 Appraising the evidence gathered with regard to its validity and 
relevance
4.	 Integrating this appraisal with knowledge about the unique aspects 
of the patient (including the patient’s preferences about the possible 
outcomes)
Decision-Making in Clinical Medicine 
The process of searching the world’s research literature and apprais­
ing the quality and relevance of studies can be time-consuming and 
requires skills and training that most clinicians do not possess. In a 
busy clinical practice, the work required is also logistically not fea­
sible. This has led to a focus on finding recent systematic overviews of 
the problem in question as a useful shortcut in the EBM process. Sys­
tematic reviews are regarded by some as the highest level of evidence 
in the EBM hierarchy because they are intended to comprehensively 
summarize the available evidence on a particular topic. To avoid the 
potential biases found in narrative review articles, predefined repro­
ducible explicit search strategies and inclusion and exclusion criteria 
seek to find all of the relevant scientific research and grade its quality. 
The prototype for this kind of resource is the Cochrane Database of 
Systematic Reviews. When appropriate, a meta-analysis is used to 
quantitatively summarize the systematic review findings (discussed 
further below).
Unfortunately, systematic reviews are not uniformly the acme of 
the EBM process they were initially envisioned to be. In select cir­
cumstances, they can provide a much clearer picture of the state of 
the evidence than is available from any individual clinical report, but 
their value is less clear when only a few trials are available, when trials 
and observational studies are mixed, or when the evidence base is only 
observational. They cannot compensate for deficiencies in the under­
lying research available, and many are created without the requisite 
clinical insights. The medical literature is now flooded with systematic 
reviews of varying quality and clinical utility. The peer review system 
has, unfortunately, not proved to be an effective arbiter of quality of 
these papers. Therefore, systematic reviews should be used with cir­
cumspection in conjunction with selective reading of some of the best 
empirical studies.
■
■SOURCES OF EVIDENCE: CLINICAL 

TRIALS AND REGISTRIES
The notion of learning from observation of patients is as old as medi­
cine itself. Over the past 50 years, physicians’ understanding of how 
best to turn raw observation into useful evidence has evolved consider­
ably. Medicine has received a hard refresher lesson in this process from 
the COVID-19 pandemic. Starting in the spring of 2020, case reports, 
personal and institutional anecdotal experience, and small singlecenter case series started appearing in the peer-reviewed literature and 
within months turned into a flood of confusing and often contradic­
tory evidence. Observational reports of treatments for COVID-19 
fueled the confusion. Despite >40,000 publications appearing in the 
first 7 months of the pandemic, an enormous amount of uncertainty 
around prevention, diagnosis, treatment, and prognosis of the dis­
ease remained. Many of the early 2020 publications were either small 
observational series or reviews of published series, neither of which 
can resolve the key uncertainties clinicians need to address in caring 
for these patients. These small observational studies often have sub­
stantial limitations in validity and generalizability, and although they 
may generate important hypotheses or be the first reports of adverse 
events or therapeutic benefit, they have no role in formulating modern 
standards of practice. The major tools used to develop reliable evidence 
consist of randomized clinical trials supplemented strategically by large 
(high-quality) observational registries. A registry or database typically 
is focused on a disease or syndrome (e.g., different types of cancer, 
acute or chronic CAD, pacemaker capture, or chronic heart failure), 
a clinical procedure (e.g., bone marrow transplantation, coronary

revascularization), or an administrative process (e.g., claims data used 
for billing and reimbursement).

By definition, in observational data, the investigator does not con­
trol patient care. Carefully collected prospective observational data, 
however, can at times achieve a level of evidence quality approaching 
that of major clinical trial data through trial emulation (specifying 
eligibility criteria, interventions, outcome, follow-up, causal contrast, 
and statistical analysis) using causal inference methods. At the other 
end of the spectrum, data collected retrospectively (e.g., chart review) 
are limited in form and content to what previous observers recorded 
and may not include the specific research data being sought (e.g., 
claims data). Advantages of observational data include the inclusion 
of a broader population as encountered in practice than is typically 
represented in clinical trials because of their restrictive inclusion and 
exclusion criteria. In addition, observational data provide primary 
evidence for research questions when a randomized trial cannot be 
performed. For example, it would be difficult to randomize patients to 
test diagnostic or therapeutic strategies that are unproven but widely 
accepted in practice, and it would be unethical to randomize based on 
sex, racial/ethnic group, socioeconomic status, or country of residence 
or to randomize patients to a potentially harmful intervention, such as 
smoking or deliberately overeating to develop obesity.
PART 1
The Profession of Medicine
A well-done prospective observational study of a particular man­
agement strategy differs from a well-done randomized clinical trial 
most importantly by its lack of protection from treatment selection 
bias. The use of observational data to compare diagnostic or thera­
peutic strategies assumes that sufficient uncertainty and heteroge­
neity exists in clinical practice to ensure that similar patients will 
be managed differently by diverse physicians. In short, the analysis 
assumes that a sufficient element of randomness (in the sense of 
disorder rather than in the formal statistical sense) exists in clini­
cal management. In such cases, statistical models attempt to adjust 
for important imbalances to “level the playing field” so that a fair 
comparison among treatment options can be made. When manage­
ment is clearly not random (e.g., all eligible left main CAD patients 
are referred for coronary bypass surgery), the problem may be too 
confounded (biased) for statistical correction, and observational data 
may not provide reliable evidence.
In general, the use of concurrent controls is vastly preferable to that 
of historical controls. For example, comparison of current surgical 
management of left main CAD with medically treated patients with left 
main CAD during the 1970s (the last time these patients were routinely 
treated with medicine alone) would be extremely misleading because 
“medical therapy” has substantially improved in the interim.
Randomized controlled clinical trials include the careful prospective 
design features of the best observational data studies but also include 
the use of random allocation of treatment. This design provides the 
best protection against measured and unmeasured confounding due to 
treatment selection bias (a major aspect of internal validity). However, 
the randomized trial may not have good external validity (generaliz­
ability) if the process of recruitment into the trial resulted in the exclu­
sion of many potentially eligible subjects or if the nominal eligibility for 
the trial describes a very heterogeneous population.
Consumers of medical evidence need to be aware that randomized 
trials vary widely in their quality and applicability to practice. The 
process of designing such a trial often involves many compromises. 
For example, trials designed to gain U.S. Food and Drug Administra­
tion (FDA) approval for an investigational drug or device must fulfill 
regulatory requirements (such as the use of a placebo control) that may 
result in a trial population and design that differ substantially from 
what practicing clinicians would find most useful.
■
■META-ANALYSIS
The Greek prefix meta signifies something at a later or higher stage 
of development. Meta-analysis is research that combines and sum­
marizes the available evidence quantitatively. Although it is used to 
examine nonrandomized studies, meta-analysis is most useful for 
summarizing all available randomized trials examining a particular 
therapy used in a specific clinical context. Ideally, unpublished trials 

should be identified and included to avoid publication bias (i.e., miss­
ing “negative” trials that may not be published). Furthermore, the best 
meta-analyses obtain and analyze individual patient-level data from 
all trials rather than using only the summary data from published 
reports. Nonetheless, not all published meta-analyses yield reliable 
evidence for a particular problem, so their methodology should be 
scrutinized carefully to ensure proper study design and analysis. The 
results of a well-done meta-analysis are likely to be most persuasive if 
they include at least several large-scale, properly performed random­
ized trials. Meta-analysis can especially help detect benefits when 
individual trials are inadequately powered (e.g., the benefits of strep­
tokinase thrombolytic therapy in acute MI demonstrated by ISIS-2 in 
1988 were evident by the early 1970s through meta-analysis). However, 
in cases in which the available trials are small or poorly done, metaanalysis should not be viewed as a remedy for deficiencies in primary 
trial data or trial design.
Meta-analyses typically focus on summary measures of relative 
treatment benefit, such as odds ratios or relative risks. Clinicians 
should also examine what absolute risk reduction (ARR) can be 
expected from the therapy. A metric of absolute treatment benefit that 
is frequently reported is the number needed to treat (NNT) to prevent 
one adverse outcome event (e.g., death, stroke). NNT should not be 
interpreted literally as a causal statement. NNT is simply 1/ARR. For 
example, if a hypothetical therapy reduced mortality rates over a 5-year 
follow-up by 33% (the relative treatment benefit) from 12% (control 
arm) to 8% (treatment arm), the ARR would be 12% – 8% = 4% and 
the NNT would be 1/0.04, or 25. This does not mean literally that 1 
patient benefits and 24 do not. However, it can be conceptualized as an 
informal measure of treatment efficiency. If the hypothetical treatment 
was applied to a lower-risk population, say, with a 6% 5-year mortal­
ity, the 33% relative treatment benefit would reduce absolute mortality 

by 2% (from 6 to 4%), and the NNT for the same therapy in this lowerrisk group of patients would be 50. Although often not made explicit, 
comparisons of NNT estimates from different studies should account 
for the duration of follow-up used to create each estimate. In addition, 
the NNT concept assumes a homogeneity in response to treatment that 
may not be accurate. The NNT is simply another way of summarizing 
the absolute treatment difference and does not provide any unique 
information.
■
■CLINICAL PRACTICE GUIDELINES
Per the 1990 Institute of Medicine definition, clinical practice guide­
lines are “systematically developed statements to assist practitioner 
and patient decisions about appropriate health care for specific clinical 
circumstances.” This definition emphasizes several crucial features of 
modern guideline development. First, guidelines are created by using 
the tools of EBM. In particular, the core of the development process is 
a systematic literature search followed by a review of the relevant peer-

reviewed literature. Second, guidelines usually are focused on a 
clinical disorder (e.g., diabetes mellitus, stable angina pectoris) or a 
health care intervention (e.g., cancer screening). Third, the primary 
objective of guidelines is to improve the quality of medical care by 
identifying care practices that should be routinely implemented, 
based on high-quality evidence and high benefit-to-harm ratios 
for the interventions. Guidelines are intended to “assist” decisionmaking, not to define explicitly what decisions should be made in a 
particular situation, in part because guideline-level evidence alone is 
never sufficient for clinical decision-making (e.g., deciding whether 
to intubate and administer antibiotics for pneumonia in a terminally 
ill individual, in an individual with dementia, or in an otherwise 
healthy 30-year-old mother).
Guidelines are narrative documents constructed by expert panels 
whose composition often is determined by interested professional 
organizations. These panels vary in expertise and in the degree to 
which they represent all relevant stakeholders. The guideline docu­
ments consist of a series of specific management recommendations, 
a summary indication of the quantity and quality of evidence sup­
porting each recommendation, an assessment of the benefit-to-harm 
ratio for the recommendation, and a narrative discussion of the