2.5 Bioinformatics 67
2.5 Bioinformatics 67
ESSENTIALS Bioinformatics may be defined as ‘conceptualizing biology in terms of molecules and applying “informatics techniques” (e.g. applied mathematics, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale’. Clinical bioinformatics may be defined as ‘the clinical application of bioinformatics-associated sciences and technologies to under- stand molecular mechanisms and potential therapies for human diseases’. To achieve these aims: (1) data must be curated to facilitate stand- ardized access to existing information and to allow the submission of new entries to data sets; (2) analysis tools should be developed drawing upon both computational and biological/clinical expertise; (3) all analyses must be interpreted in a biologically/clinically mean- ingful manner. Introduction If clinical bioinformatics is to deliver the integration of molecular and clinical data and thereby translate research knowledge into ef- fective ‘personalized’ medicine, then two broad constituencies need to be supported. Clinicians at the point of care need to understand and integrate, perhaps via decision support mechanisms, entities such as genotype/phenotype correlations, biomarker discovery, and pharmacogenomics; while researchers require accurate, struc- tured and (ideally) coded clinical data, as well as biological reference data sets. Ever accelerating technological advances and precipitous falls in the cost of high-throughput technologies (e.g. whole genome sequencing, expression profiling, high resolution image processing and others) means that there is a veritable deluge of available data. Accompanying falls in the cost of the substantial computational power and associated data storage needed also mean that there is opportunity for meaningful analysis. Ever faster turnaround times (currently measured in hours) mean that it is now feasible to intro- duce next-generation sequencing (NGS) into workflows directly contributing to patient care. NGS technologies allow the identification of single nucleo- tide polymorphisms, point mutations, and insertions or deletions (indels) as well as larger structural changes such as translocations, rearrangements, inversions, duplications, and copy number vari- ations. When investigating somatic mutations e.g. in tumours, comparison with germline samples facilitates variant detection. In rare diseases the comparison within a given trio, proband and both parents, serves a similar process. The need for defined metrics to inform strict criteria-based quality assurance is crucial if a clinical bioinformatics pipeline is to be setup. NGS has additional capabilities to investigate cellular proper- ties over and above the determination of genomic sequence alone. Epigenomics deals with the chemical modifications of nucleic acids (e.g. 5′ methylation, and the consequent effect on gene expres- sion). NGS offers the potential to identify changes across the entire genome by capturing epigenetic information from multiple genes simultaneously. Given that for some tumours epigenetic status re- flects the overall prognosis, such analyses may provide substantially enhanced prognostic information. However, simply aggregating patient-specific clinical data with genetic, expression, or other data will not automatically lead to better clinical outcomes. Clinical data is often unstructured and incomplete, being spread across multiple paper and electronic sys- tems, hence clinically meaningful semantic vocabulary standards are needed (see next). At a cellular level it is clear that biology is not solely dependent on the genome sequence alone, and in 2012 it was estimated that the biological function of approximately half of all human genes remained unknown. Projects such as the Encyclopaedia of DNA Elements (ENCODE), currently building a comprehensive list of the functional elements in the human genome, and the Kyoto Encyclopaedia of Genes and Genomes (KEGG), which supports machine executed models of system-level biological pathways, are important. Without these the ability to in- terpret analyses and to draw relevant biomedical patient-specific conclusions will be severely impeded. 2.5 Bioinformatics Afzal Chaudhry
68 SECTION 2 Background to medicine Components needed for clinical bioinformatics Data storage Flexible, extensible data warehousing is essential to accommodate the volume of clinical and biomedical information. The ability to support multiple data sources containing heterogeneous data sets is crucial, and data structures must also be able to accommodate sparse data sets as it is unlikely that any individual clinical record will contain information on all possible concepts. Typical warehouse designs are built upon a ‘dimensional fact’ model. Here, a fact is a concept relevant to decision-making (e.g. an observation made at a specific point in time such as a blood pressure measurement), while a dimension describes some attri- bute of that fact (e.g. the blood pressure was measured with the patient supine). Use of a common set of semantic terms to support data aggregation/interoperability Control of the metadata dictionary describing all of the facts in the warehouse is essential. For example, for laboratory tests, normal ranges, and assay types may change over time and/or may vary from one laboratory to another. It is impossible then to meaning- fully aggregate data over time or from multiple laboratories un- less the results (facts) are interpreted using the relevant metadata (dimensions). Examples of such metadata include the struc- tured vocabulary/ontologies listed in Table 2.5.1. Hierarchical ontological terminologies reflecting clinical meaning such as SNOMED CT are preferred over more epidemiological orien- tated classification systems such as ICD-10. Dimensions should ideally be described using elements from a defined archetype rep- resented in a definition set such as the openEHR reference model. When using natural language processing technology to extract structured standardized data from unstructured information, the extraction should be ‘supervised’ by the metadata dictionaries to allow data from text-based records to be amalgamated with that from a coded record. Meaningful analytical tools There are multiple data sets and tools to support the analysis of bio- informatics data (Tables 2.5.2 and 2.5.3). Typically, these focus on assessing similarities between molecular sequences based on align- ment. Sequence-based data sets significantly outnumber structural- based data sets because of the relative ease by which sequence data can be obtained. Additional analyses, often using specialized soft- ware, are needed over and above simple alignment analyses to detect clinically relevant structural genomic alterations. Protein orientated databases are often categorized as either pri- mary, detailing the linear amino-acid sequence, or secondary, containing derived information. Secondary-based analyses may consider motifs or electrostatic interactions that are contiguous in three-dimensional space but not in the linear sequence. Some macromolecular three-dimensional structure databases contain a hierarchical taxonomy to help identify evolutionary relationships. Ultimately the most value is seen by combining data from multiple sources—clinical, sequence, structure, expression, and function (or as many of these as exist). This may not always be straightforward due to variations in nomenclature and data formats, although web- based gateways supporting the traversal of multiple databases are becoming more effective. Examples of clinical/research areas benefitting from clinical bioinformatics strategies See Table 2.5.4. Oncology research Oncology research has tended to focus on single gene and single pathway analysis. However, NGS offers both multiple simultaneous analyses and extremely high sequence coverage thus substantially increasing sensitivity. International consortia such as the Cancer Genome Atlas are sequencing thousands of cancers to generate data sets across different cancer subtypes. Computational theories Table 2.5.1 Clinical classifications/terminologies/structured vocabularies available in the United Kingdom Name (acronym) Full name Clinical related entities described URL (accessed November 2018) ICD-10 International Classification of Disease 10 Diagnoses https://www.who.int/classifications/icd/ icdonlineversions/en/ SNOMED CT Systematized Nomenclature of Medicine— Clinical Terms Symptoms, signs, diagnoses http://www.ihtsdo.org/snomed-ct dm+d NHS Dictionary of Medicines and Devices Medication http://dmd.medicines.org.uk/ NLMC National Laboratory Medicine Catalogue Laboratory investigations https://nlmc.x-labsystems.co.uk/ LOINC Logical Observation Identifiers Names and Codes Laboratory investigations https://loinc.org/ NICIP National Interim Clinical Imaging Procedure code set Radiological investigations https://digital.nhs.uk/services/terminology-and- classifications/national-interim-clinical-imaging- procedure-nicip-code-set OPCS 4.7 Office of Population Censuses and Surveys Classification of Interventions and Procedures version 4.7 Procedures https://isd.digital.nhs.uk/trud3/user/guest/group/0/ pack/10 HPO Human Phenotype Ontology Phenotypic abnormalities encountered in human disease https://hpo.jax.org/app/
2.5 Bioinformatics
69
including pathway network analysis and graph theory can be used
to model tumour-related regulatory networks and interactions,
allowing complex interactions to be understood. The predictive
power of multigene biomarker panels, now potentially scaled into
the many thousands of genes analysed simultaneously as opposed
to just tens of genes, can be profoundly enhanced (e.g. in one study
a panel of 2300 genes could discriminate adenocarcinoma of the
lung from healthy tissue with 100% accuracy).
Pharmacogenomics
As the genetic/molecular basis of the metabolism and mechanism
of action of drugs becomes increasingly understood, so therapy can
be individually tailored to some degree. Recent examples include
trastuzumab for HER2-positive breast cancer and imatinib for chronic
myeloid leukaemia and conditions associated with tyrosine kinase-
based mutations. NGS can identify somatic variants which help to
direct therapy (e.g. as resistant tumour clones emerge—melanoma
with the BRAF mutation V600E is susceptible to vemurafenib while
the p61 BRAF V600E variant is not). In haematology the potential to
stratify an individual’s sensitivity to warfarin (VKORC1 and CYP2C9
gene polymorphisms) will help to guide appropriate dosing and avoid
potentially life-threatening events.
Infectious diseases
The far smaller size of viral and bacterial genomes makes it possible
to sequence the genome of infecting pathogens. In the case of the
2009 H1N1 influenza pandemic, bioinformatics tools were able to
describe within a few hours of the first identification of a novel muta-
tion a possible mechanistic explanation by which it was able to mani-
fest such a severe phenotype. Computational analysis of genome
sequence and protein structures can help in identifying likely drug
susceptibility (e.g. the enterohaemorrhagic O104:H4 E. coli outbreak
in Germany in 2011), while individual infecting strains can be typed
and traced over both time and geographical distribution so sup-
porting more appropriate and economical public health strategies.
Digital imaging
Even among the most experienced histopathologists there can be
considerable interobserver variation in certain conditions. Objective
algorithms to identify tumour grading and to search for other tissue-
based measures of disease activity using level sets, fractal analysis,
and machine learning can improve diagnosis. The adaptation of
astronomical algorithms coupled with their application to large an-
notated study cohorts is likely to provide a powerful set of analytical
tools. In dermatology, texture analysis, neural network framework-
based analyses, data mining of skin images and computer-based
reconstruction of the skin surface have all been used to support re-
search into reliable diagnostic strategies.
Table 2.5.2 Publicly available databases of biological knowledge
Domain
Database
URL (accessed November 2018)
Nucleotide
DDBJ
https://www.ddbj.nig.ac.jp
GenBank
https://www.ncbi.nlm.nih.gov/genbank/
Genome
COGs
https://www.ncbi.nlm.nih.gov/COG/
Entrez genome
https://www.ncbi.nlm.nih.gov/genome
GeneCensus
http://bioinfo.mbb.yale.edu/genecensus/
Protein—primary
NRDB
https://www.ncbi.nlm.nih.gov/protein
OWL
http://130.88.97.239/OWL/index.php
SWISS-PROT
https://www.uniprot.org/uniprot/?query=reviewed:yes
Protein—secondary
Pfam
http://pfam.xfam.org/
PRINTS
http://130.88.97.239/PRINTS/index.php
PROSITE
https://prosite.expasy.org/
Protein—macromolecular
CATH
http://www.cathdb.info/
PDBeFold
http://www.ebi.ac.uk/msd-srv/ssm/
Protein Data Bank
https://www.rcsb.org/
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop/index.html
Functional/systems biology
ENCODE
https://genome.ucsc.edu/ENCODE/
KEGG
https://www.genome.jp/kegg/
Vocabulary
Gene Ontology
http://geneontology.org/
Integrated systems/web gateways
InterPro
https://www.ebi.ac.uk/interpro/
Uniprot
http://www.uniprot.org/
Table 2.5.3 Major research institutions providing access to a wide
range of bioinformatics databases and analysis tools
EMBL-European Bioinformatics
Institute
https://www.ebi.ac.uk/
NCBI
(National Centre for
Biotechnology Information)
http://www.ncbi.nlm.nih.gov/
Wellcome Trust Sanger Institute
https://www.sanger.ac.uk/science/tools
70
SECTION 2 Background to medicine
Conclusions
The need to deliver safe, timely, sustainable, and patient-centric
care along with the need for evidence-based strategies means that
any new technology first has to demonstrate clear translational
research benefits before it can be adopted into routine practice.
We now have the means to generate and analyse data sets that can
lead to such breakthroughs, and this continues to develop at a
breakneck pace.
If the most is to be made of these new data sets and analyses
then not only must appropriate clinical data be available for cor-
relation, but there will also be a need for more structured training
programmes and curricula to train clinicians in the analysis,
interpretation, and use of such data. The UK Health Education
England Genomics Education Programme Clinical Bioinformatics
group reported in early 2015 and has recommended a series of
steps and programmes for a range of healthcare staff to address the
long-term goal of establishing a workforce fit for genomic medicine.
The implementation of such recommendations is awaited.
FURTHER READING
Raza S (2014). Defining the role of a bioinformatician. http://www.
phgfoundation.org/briefing/defining-the-role-of-a-bioinformatician
Slade I, Burton H (2016). Preparing clinicians for genomic medicine.
Postgraduate Medical Journal, 92, 369–71.
Table 2.5.4 Examples of clinical/research areas benefitting from clinical bioinformatics strategies
Clinical/research area
Example
DNA/RNA sequencing and
expression profiling
Comprehension of biomolecular pathways underlying malignant transformation
Biomarker identification
Improved classification, early diagnosis, prognostication, and tailoring of therapy
Pharmacogenomics/proteomics
Tailoring of therapy—likelihood of therapeutic benefit as well as likely propensity to side effects
Pathogen genome/protein
sequence/structure and function
Description of putative mechanisms underpinning phenotypic manifestations
Susceptibility to antimicrobial therapy
Epidemiological analysis to identify environmental factors and support the relevant control mechanisms
Digital imaging
Machine learning based improvements in cellular/tissue analysis and diagnosis
Image analysis of three-dimensional geometric/structural anatomy using both visible (e.g. glaucoma), and nonvisible
spectra (e.g. infrared analysis of meibomian gland morphology in dry eye syndromes)
No comments to display
No comments to display