All Posts By

nashvillebios

human genetics

Human Genetics as a Strategic Imperative to Accelerate Drug Discovery: The Alliance for Genomic Discovery

By | Clinical Genomics

Key Takeaways:

  • Pharmaceutical development is high-risk and resource-intensive, with a 90% failure rate in clinical trials, often due to inadequate efficacy, toxicity, drug properties, or commercial viability.
  • Incorporating human genetic evidence doubles drug approval rates, paving the way for innovative therapies and new molecular entities.
  • Techniques like GWASs and PheWAS linking genetic data to phenotypic data enhance drug development by identifying associations between rare alleles and diseases.
  • Published human genetic studies, primarily centered on individuals of European descent, hinder our understanding of genetic diversity and impede the development of new therapies suitable for diverse populations; therefore, establishing study cohorts with under-represented populations is crucial for promoting health equality and identifying novel drug targets based on diverse genetic variants.
  • The Alliance for Genomic Discovery (AGD) aims to reshape drug development by sequencing 250,000 diverse samples, providing a powerful resource for pharmaceutical members to correlate genetic variations with clinical outcomes and, in turn, enabling these companies to better serve a global population.

 

The Struggle to Discover New Therapies

Discovering and developing pharmaceuticals is a resource-intensive and high-risk endeavor, sometimes spanning 15 years with costs exceeding $2 billion for their approval (Hinkson et al., 2020). Shockingly, about nine out of ten potential therapies, upon progressing to clinical trials, fail before approval (Dowden & Munro, 2019; Sun et al., 2022). The four primary contributors to the staggering 90% failure rate in drug development are inadequate clinical efficacy, unmanageable toxicity, suboptimal drug-like properties and a lack of commercial viability (Dowden & Munro, 2019; Harrison, 2016; Sun et al., 2022). To increase the chances of a drug target passing these critical checkpoints, considerable endeavors can be directed towards incorporating human genetic evidence into drug development.

In the drug development pipeline, all compounds before entering clinical phases must undergo rigorous testing in animal models, providing significant evidence of their potential to treat diseases. However, despite promising results in preclinical studies, the translation of efficacy and safety from animal models to human clinical trials is often elusive. Integrating human genetic evidence into the drug development process has recently emerged as a crucial strategy to navigate this challenge. Drugs grounded in such evidence exhibit a twofold increase in approval rates (Nelson et al., 2015), contributing to a higher prevalence of first-in-class therapies and new molecular entities (NMEs) (King et al., 2019). This not only accelerates the approval process but also streamlines the discovery of more effective and targeted treatments. Leveraging human genetic data empowers researchers with valuable insights into the genetic basis of diseases, facilitating the identification of better drug targets. The substantial presence of genetic evidence in FDA-approved drugs in 2021 (Ochoa et al., 2022) underscores its instrumental role in advancing drug discovery and fostering the emergence of innovative pharmaceutical solutions.

Linking Genetics to Clinical Data for Drug Discovery

To incorporate genetics into therapeutic development, researchers can link the genetic code of an individual to their Electronic Health Records (EHRs). Researchers can use techniques like Genome-wide association studies (GWASs), Phenome-wide association studies (PheWAS), Mendelian Randomization or Loss/Gain-of-Function Variants to discover associations between

rare alleles and human disease (Krebs & Milani, 2023). Using these techniques, drugs tailored for Mendelian disorders have achieved notable success in clinical trials and approvals (Heilbron et al., 2021). For instance, the genetic disease Autosomal dominant hypercholesterolemia (ADH) confers an increased risk of coronary artery disease (CAD) through elevated levels of plasmatic low-density lipoprotein (LDL). By linking phenotypic data with genetic data, researchers were able to identify the association of the PCSK9 gene with high LDL levels (Abifadel et al., 2003). This kickstarted a series of studies that culminated in the approval of two monoclonal antibodies that inhibit PCSK9, Repatha (Evolocumab) and Praluent (Alirocumab) (Krebs & Milani, 2023; Robinson et al., 2015) with their treatment reducing the rate of major adverse cardiovascular events by half (Kaddoura et al., 2020). Indeed, therapies derived from these kinds of impactful rare alleles exhibit a 6-7.2 times greater likelihood of receiving approval due to their substantial effect on symptoms (Nelson et al., 2015; King et al., 2019). However, for many prevalent diseases, heritable risk is predominantly associated with numerous common variants, each having smaller individual effect sizes. This intricate genetic landscape complicates the identification of therapeutic targets, making the discovery of new avenues for therapy challenging and necessitating new strategies.

So far, a disproportionate number of published human genetic studies have centered on individuals of European descent (Fatumo et al., 2022). However, this narrow focus restricts our understanding to a limited diversity of alleles and genetic disorders, hindering the development of new therapies. To promote health equality, it’s crucial to establish study cohorts that include under‐represented populations. After all, individuals of European descent represent only a fraction of the total human genetic variation (Heilbron et al., 2021). Diverse cohorts represent unique opportunities for identifying novel drug targets based on genetic variants that are less frequent or even absent in people of European ancestry. Genetic discoveries will have greater discovery power in populations where a disease is more prevalent and, hence, with larger disease cohorts; at the same time, these discoveries will be more relevant and beneficial for these populations.

Founding the Alliance for Genomic Discovery

This need to identify rare genetic variants in diverse patient cohorts has driven the collaboration of NashBio and Illumina Inc. to establish AGD. AGD, comprising eight member organizations—AbbVie, Amgen, AstraZeneca, Bayer, Merck, Bristol Myers Squibb (BMS), GlaxoSmithKline Pharmaceuticals (GSK), and Novo Nordisk (Novo)—aims to expedite therapeutic development through whole-genome sequencing (WGS) 250,000 samples from Vanderbilt University Medical Center’s (VUMC) biobank repository, BioVU®. As the first phase in AGD, deCODE genetics performed WGS on the first 35,000 VUMC samples, primarily made up of DNA from individuals of African ancestry. Moving forward, deCODE/Amgen will sequence the remaining samples for the Alliance members to have access to the resulting data for drug discovery and therapeutic development. The WGS data will then be linked with structured EHR data from NashBio and VUMC, creating a valuable resource for pharmaceutical members to correlate genetic variations with clinical outcomes. To learn more about how AGD aims to accelerate drug discovery and to hear directly from the alliance members, click here.

Summary

AGD marks a pivotal step in reshaping drug development, offering a solution to the challenges plaguing the pharmaceutical industry. With a staggering 90% failure rate in clinical trials, the incorporation of human genetic evidence into drug development by AGD aims to increase the approval likelihood of drug targets, fostering the discovery of more effective and targeted treatments. AGD also aims to address the limitations of existing genetic resources and studies. The WGS of 250,000 samples, encompassing diverse populations and linked with structured EHR data, provides pharmaceutical members with a powerful resource. This not only accelerates drug discovery but also facilitates the development of tailored therapies. AGD represents a significant step toward healthcare equality, highlighting the importance of diverse genetic studies in progressing drug discovery for the benefit of all people.

 

References

Abifadel, M., Varret, M., Rabès, J.-P., Allard, D., Ouguerram, K., Devillers, M., Cruaud, C., Benjannet, S., Wickham, L., Erlich, D., Derré, A., Villéger, L., Farnier, M., Beucler, I., Bruckert, E., Chambaz, J., Chanu, B., Lecerf, J.-M., Luc, G., … Boileau, C. (2003). Mutations in PCSK9 cause autosomal dominant hypercholesterolemia. Nature Genetics, 34(2), 154–156. https://doi.org/10.1038/ng1161

Dowden, H., & Munro, J. (2019). Trends in clinical success rates and therapeutic focus. Nature Reviews. Drug Discovery, 18(7), 495–496. https://doi.org/10.1038/d41573-019-00074-z

Fatumo, S., Chikowore, T., Choudhury, A., Ayub, M., Martin, A. R., & Kuchenbaecker, K. (2022). A roadmap to increase diversity in genomic studies. Nature Medicine, 28(2), 243–250. https://doi.org/10.1038/s41591-021-01672-4

Harrison, R. K. (2016). Phase II and phase III failures: 2013-2015. Nature Reviews. Drug Discovery, 15(12), 817–818. https://doi.org/10.1038/nrd.2016.184

Heilbron, K., Mozaffari, S. V, Vacic, V., Yue, P., Wang, W., Shi, J., Jubb, A. M., Pitts, S. J., & Wang, X. (2021). Advancing drug discovery using the power of the human genome. The Journal of Pathology, 254(4), 418–429. https://doi.org/10.1002/path.5664

Hinkson, I. V., Madej, B., & Stahlberg, E. A. (2020). Accelerating Therapeutics for Opportunities in Medicine: A Paradigm Shift in Drug Discovery. Frontiers in Pharmacology, 11. https://doi.org/10.3389/fphar.2020.00770

Kaddoura, R., Orabi, B., & Salam, A. M. (2020). PCSK9 Monoclonal Antibodies: An Overview. Heart Views : The Official Journal of the Gulf Heart Association, 21(2), 97–103. https://doi.org/10.4103/HEARTVIEWS.HEARTVIEWS_20_20

King, E. A., Davis, J. W., & Degner, J. F. (2019). Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLOS Genetics, 15(12), e1008489. https://doi.org/10.1371/journal.pgen.1008489

Krebs, K., & Milani, L. (2023). Harnessing the Power of Electronic Health Records and Genomics for Drug Discovery. Annual Review of Pharmacology and Toxicology, 63(1), 65–76. https://doi.org/10.1146/annurev-pharmtox-051421-111324

Nelson, M. R., Tipney, H., Painter, J. L., Shen, J., Nicoletti, P., Shen, Y., Floratos, A., Sham, P. C., Li, M. J., Wang, J., Cardon, L. R., Whittaker, J. C., & Sanseau, P. (2015). The support of human genetic evidence for approved drug indications. Nature Genetics, 47(8), 856–860. https://doi.org/10.1038/ng.3314

Ochoa, D., Karim, M., Ghoussaini, M., Hulcoop, D. G., McDonagh, E. M., & Dunham, I. (2022). Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nature Reviews. Drug Discovery, 21(8), 551. https://doi.org/10.1038/d41573-022-00120-3

Robinson, J. G., Farnier, M., Krempf, M., Bergeron, J., Luc, G., Averna, M., Stroes, E. S., Langslet, G., Raal, F. J., El Shahawy, M., Koren, M. J., Lepor, N. E., Lorenzato, C., Pordy, R., Chaudhari, U., & Kastelein, J. J. P. (2015). Efficacy and Safety of Alirocumab in Reducing Lipids and Cardiovascular Events. New England Journal of Medicine, 372(16), 1489–1499. https://doi.org/10.1056/NEJMoa1501031

Sun, D., Gao, W., Hu, H., & Zhou, S. (2022). Why 90% of clinical drug development fails and how to improve it? Acta Pharmaceutica Sinica. B, 12(7), 3049–3062. https://doi.org/10.1016/j.apsb.2022.02.002

healthcare data

De-identification: Balancing Privacy and Utility in Healthcare Data

By | Healthcare Data

Key Takeaways:

  • De-identification is the process of removing or obscuring personal health information in medical records to protect patient privacy.
  • De-identification is critical for enabling the sharing of data for secondary research purposes such as public health studies while meeting privacy regulations like HIPAA.
  • Common de-identification techniques include suppression, generalization, perturbation and synthetic data generation.
  • There is often a balance between data utility and privacy risk that must be evaluated on a case-by-case basis when de-identifying data.
  • Emerging privacy-enhancing computation methods like federated learning and differential privacy offer complementary approaches to de-identification.

What is De-identification and Why is it Important for Healthcare?

Patient health information is considered highly sensitive data in need of privacy protections. However medical data sharing enables critically important research on public health, personalized medicine and more. De-identification techniques that remove identifying information and decrease the risk of exposing protected health information serve a crucial role in balancing these needs for privacy and innovation.

Definitions and Concepts

The HIPAA Privacy Rule defines de-identification as the process of preventing a person’s identity from being connected with health information. Once data has been de-identified per the Privacy Rule’s standards, it is no longer considered protected health information (PHI) and can be freely shared for research use cases like public health studies, therapeutic effectiveness studies  and medical informatics analytics.

Perfect de-identification that carries no risk of re-identification of patients is very difficult, if not impossible, to accomplish with current technology. As a result, regulations like HIPAA allow for formal designations of “de-identified” health data based on achieving sufficient pseudonymity through the suppression or generalization of identifying tag elements. HIPAA also defines a limited data set containing certain scrubbed identifiers that can be shared with a data use agreement rather than fully stripped identifiers.

The re-identification risk spectrum ranges from blatant identifiers like names, home addresses and social security numbers to quasi-identifiers like birthdates and narrowed locations that would not directly name the patient but could be pieced together to deduce identity in combination, especially as external data sources grow more public over time. State-of-the-art de-identification evaluates both blatant and quasi-identifier risks to minimize traceability while maximizing analytic utility.

Motivating Use Cases

Research and public health initiatives rely on the sharing of de-identified health data to drive progress on evidence and outcomes. The Cancer Moonshot’s data sharing efforts highlight the massive potential impact of medical databases, cohorts and real-world evidence generation on accelerating cures via de-identified data aggregation and analytics. The open FDA program demonstrates governmental encouragement of privacy-respecting access to regulatory datasets to inform digital health entrepreneurs. Patient matching in these fragmented healthcare datasets would be impossible using directly identifiable data. Apple’s ResearchKit and CareKit frameworks facilitate de-identified mobile health data sharing for app developers to build new participatory research applications.

Data marketplaces and trusted third parties are emerging to certify and exchange research-ready, consented data assets like clinico-genomic data underlying scientific publications and clinical trials. Startups and health systems manage data sharing agreements and audit logs around distributed sites leveraging de-identified data. Rich metadata combined with privacy-preserving record linkage techniques that avoid direct identifiers enables specific patient subgroup analytics without compromise.

Overall research efficiency improves when more participants openly share their health data. But none of this research progress would be possible if stringent de-identification practices were not implemented to earn patient trust in data sharing.

De-Identification Techniques and Standards

There are two high level categories of common de-identification protocols in healthcare: 1) suppressing blatant identifiers, typically following frameworks like HIPAA, and 2) actively transforming the data itself through various forms of generalization, perturbation or synthetic data production.

Suppressing Identifiers

The HIPAA Privacy Rule designates 18 categories of Protected Health Information identifiers that must be removed to achieve de-identified status, including names, geographic details narrower than state level, all dates other than years, contact information, IDs and record numbers, vehicle and device identifiers, URLs, IP addresses, biometrics etc.

Messages, images and unstructured data require specialized redaction processes to scrub both blatant and quasi-identifiers related to the patient, provider, institution or researchers involved. Named entity recognition and text annotation techniques help automate the detection of identifiable concepts. Voice data and video are more challenging mediums to de-identify.

Generalization and Aggregation

When formal dates, locations, ages over 89 and other quasi-identifiers cannot be completely suppressed without losing analytic value from the structured data, generalization techniques help band these details into abstract categories to preserve some descriptive statistics while hiding individual values.

Aggregating masked data across many patient records also prevents isolation of individuals. Row level de-identification risks in sparse data featuring outliers and uncommon combinations of traits can be mitigated by pooling data points into broader summaries before release rather than allowing raw access.

Perturbation

Perturbation encompasses a wide array of mathematical and statistical data alteration techniques that aim to distort the original data values and distributions while maintaining the general trends and correlations warranting analysis.

Value distortion methods include quantization to normalize numbers into ranges, squeezing and stretching value dispersion, rounding or truncating decimals, swapping similar records and discretizing continuous variables. Objects can be clustered into groups that are then analyzed in aggregate. Multiple perturbed versions of the dataset can be safely released to enable reproducible confirmation of discovered associations while avoiding leakage of the precise source data.

Combinations of generalization and perturbation provide flexibility for particular data types and contexts. The strengths, weaknesses and tuning of parameters merit a technical deep dive. The key is calibrating perturbation to maximize analytic integrity while minimizing correlation risk. Ongoing access rather than static publication also allows refinement of data treatment to meet evolving security assumptions and privacy regulations.

Synthetic Data

Synthetic datasets represent an emerging approach for modeling realistic artificial data distributions that resemble an actual patient group without containing the original records. Once the statistical shape of data properties is learned from the genuine dataset, simulated synthetic data can be sampled from generative models that emulate plausible features and relationships without allowing deduction of the real samples.

The underlying models must sufficiently capture multidimensional interactions and representation of minority groups within the patient population. Features such as ethnicity, outcomes, treatments and behaviors must be appropriately represented instead of using simplistic or biased summary statistics that ignore important correlations. Synthetic data techniques applying machine learning and differential privacy mechanisms to reconstruct distributions show significant promise for shared data sandbox environments. Cloud vendors like AWS, Google Cloud and Microsoft Azure now provide synthetic data services.

Evaluating the Risk-Utility Tradeoff

Ideally, de-identified health data removes enough identifying risk to prevent adversaries from recognizing individuals while retaining enough fidelity to offer scientific utility for the intended analyses by qualified researchers. But optimizing both privacy protection and analytic value requires navigating technical and ethical nuances around plausible re-identification vulnerabilities and scenarios balanced against access restrictions on derivative insights in the public interest.

Quantitative statistical metrics like k-anonymity models attempt to mathematically define anonymity sets with at least k records containing a combination of quasi-identifiers to avoid isolation. L-diversity metrics further generalize and dilute these groups to limit confidence of guessing the matching identity. Closeness measures how much perturbation may have altered correlations. Quantifying information loss helps data curators shape treatment processes and inclusion of synthetic records. Interpreting these model-based metrics requires understanding their assumptions and limitations with respect to adversary means and background knowledge.

More meaningful measures account for qualitative harms of identity traceability for affected groups based on socioeconomic status, minority populations, immigration factors, substance history, abuse status, disability needs and other cultural contexts that influence vulnerability irrespective of mathematical protections. Trusted access policies should offer options for verifiable rationale when requesting clearer data from data stewards who can evaluate situational sensitivity factors.

Overall responsibility falls upon custodial institutions and data safe havens to conduct contextual integrity assessments ensuring fair data flows to legitimate purposes. This means formally evaluating both welfare impacts on individuals and excluded populations, as well as potential data misuses or manipulative harms at population scale, such as discriminatory profiling. Updated governance mechanisms must address modern re-identification realities and connective threats.

Future Directions

Traditional de-identification practices struggle with handling high-dimensional, heterogeneous patient profiles across accumulating data types, modalities, apps, sensors, workflows and research studies. While valuable on their own, these techniques may fail to fully protect individuals as the ubiquity of digital traces multiplies potential quasi-identifiers. Absolute anonymity also severely limits permissible models and computations.

Emerging areas like federated analytics and differential privacy relax the goal of total de-identification by keeping raw records secured on distributed data servers and only allowing mathematical summaries to be queried from a central service so that statistical patterns can be discovered from many sites without exposing actual inputs from any one site. Legally defined LIMITED DATA SETS similarly bridge consented data access with managed identity risks for pre-vetted analysts.

Differentially private computations introduce mathematically calibrated noise to guarantee that the presence or sensitive attributes tied to any one patient will be masked across many patients. This masking allows research insights to be uncovered without revealing individual contributions. Secure multiparty computation and homomorphic encryption also enable certain restricted computations like aggregates, means and distributions executed on sensitive inputs while keeping the underlying data encrypted.

Such cryptographic methods and privacy-enhancing technologies provide complementary assurances to traditional de-identification practices. But governance, interpretation and usability remain active areas of improvement to fulfill ethical promises in practice. Holistic data safe havens must align emerging privacy-preserving computation capabilities with rigorous curation, context-based de-identification protocols and trust-based oversight mechanisms that can demonstrably justify public interest usages while preventing tangible harms to individuals and communities whose sensitive data fuels research.

Sources:
https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html 

https://www.ncsl.org/research/telecommunications-and-information-technology/hipaa-de-identification-state-laws.aspx 

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234962 

https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative 

Health data

Understanding Key Health Data Types: Clinical Trials, Claims, EHRs

By | Clinical Trials, EHR, Health Data Types

Key Takeaways:

  • Key healthcare data types include clinical trials, insurance claims, and electronic health records (EHRs), each with distinct purposes.
  • Clinical trial data directly captures efficacy and safety of interventions, but availability is limited until publication and may lack generalizability.
  • Insurance claims provide large-scale utilization patterns, outcomes metrics across diverse groups, and cost analysis, but lack clinical precision.
  • EHR data offers longitudinal individual patient history and care details in operational workflows but quality and standardization varies.
  • Combining evidence across clinical trials, claims data, and EHRs enables real-world monitoring of interventions to guide optimal decisions and policies.

In an era of big data and analytics-driven healthcare, evidence informing clinical and policy decisions draws from an expanding variety of data sources that capture different aspects of patient care and outcomes. Three vital sources of health data include structured databases tracking results of clinical trials, administrative insurance claims systems, and electronic health records (EHRs) compiled at hospitals and health systems. Each data type serves distinct purposes with inherent strengths and limitations.

This article explains the defining characteristics, appropriate use cases, and limitations of clinical trials, insurance claims data, and EHRs for healthcare and life science researchers, operators, and innovators.. Combining complementary dimensions across data types enables robust real-world monitoring of healthcare interventions to guide optimal decisions and policies matched to specific populations.

Clinical Trials

The randomized controlled trial (RCT) serves as the gold standard for evaluating safety and efficacy of diagnostic tests, devices, biologics, and therapeutics prior to regulatory approvals. Clinical trials compare treatments in specific patient groups, following strict protocols and monitoring outcomes over a set study period. Data elements captured include administered treatments, predefined clinical outcomes, patient-reported symptoms, clinician assessments, precision diagnostics, genomic biomarkers, other quantifiable endpoints, and adverse events.

RCT datasets supply the most scientifically valid assessment of efficacy and toxicity for an intervention compared to alternatives like placebos or other drugs because influential variables are intentionally balanced across study arms using eligibility criteria and random assignment. This internal validity comes at a cost of potentially reduced generalizability and applicability. As a result there is a challenge in translating benefits and risks accurately into heterogeneous real-world populations. Published trial findings often overstate effectiveness when applied more broadly. Additional data from pragmatic studies is needed to complement classical efficacy findings along the product lifecycle.

Supplemental data integration is required to expand evidence beyond the limited snapshots of clinical trial participants and into continuous monitoring of outcomes across wider populations who are prescribed the treatments clinically. Here the high-level perspectives of insurance claims data and granular clinical details contained in EHRs play a vital role.

Insurance Claims

Administrative claims systems maintained by public and commercial health insurers serve payment and reimbursement purposes rather than research goals. Yet analysis of population-level claims data containing coded diagnoses, procedures performed, medications dispensed, specialty types, facilities visited, costs billed and reimbursed enables important usage trends, treatment patterns, acute events, and cost efficiency insights which complements clinical trials.

Claims provide researchers a broad window into diagnoses, prescribed interventions, and health outcomes frequently spanning millions of covered lives across geographical regions that are absent from most trials. Claims data encompasses at all covered care delivered rather than isolated interventions. Examining trends over longer timeframes across more diverse patients who differ from strict trial eligibility enables assessment of real-world utilization frequencies, comparative effectiveness versus alternatives, clinical guideline adherence, acute complication rates, mortality metrics, readmission trends, and direct plus indirect medical costs.

However, claims data lacks the precise clinical measures systematically captured in trials and EHR records. Billing codes often fail to specify clinical severity or capture quality of life impacts. Available data elements focus primarily on how much and how often healthcare services are used rather than qualitative clinical details or patient-reported outcomes. Underlying diagnoses and accuracy of coding may require supplementary validation. Despite its limitations, claims data plays a crucial role in providing essential information for healthcare professionals, researchers, and policymakers. It serves as a valuable tool for monitoring diverse aspects of the healthcare system, ultimately contributing to the assurance of efficient, safe, and effective treatments.

While abbreviated claims codes document utilization events at a population level and clinical trials quantify experience for circumscribed groups, the patient-centric Electronic Health Record (EHR) details comprehensive individual-level clinical data as an immutable ledger accumulated over years of clinical encounters across care settings. The longitudinal EHR chronicles detailed diagnoses, signs and symptoms, lab orders and results, exam findings, procedures conducted, prescriptions written, physician notes and orders, referral details, communications around critical results, and other discrete or unstructured elements reflecting patient complexity often excluded from claims data and trials.

EHRs

EHRs provide fine-grained data for precision medicine inquiries into subsets of patients with common clinical trajectories, risk profiles, comorbidities, socioeconomic factors, access challenges, genomic risks, family histories of related illnesses, lifestyle behaviors like smoking, and personalized interventions based on advanced molecular markers. EHR data supports deep phenotyping algorithms and temporal pattern analyses that can extract cohort comparisons not feasible solely from claims.

Secondary use of EHR data faces challenges in representativeness when drawing data from single health systems rather than national networks, variability in coding terminologies and data entry fields across platforms, fragmentation forcing linkage between separate specialties and sites of care, semi-structured formats with mixed discrete codified and free text variables, and data quality gaps during clinician workflow constraints. Population-based claims data ensures inclusion of patients seeking care across all available providers rather than just one health system.

Integrating Complementary Evidence

Definitive clinical trial efficacy remains the gold standard when initially evaluating medical interventions, while large-scale claims data offers a complementary view of broader utilization patterns and comparative outcomes across more diverse populations who are receiving interventions in clinical practice. However, as interventions diffuse beyond the research setting, reliable acquisition of clinical details requires merging population-based signals from claims with deep clinical data contained uniquely within EHRs.

Combining evidence across clinical trials, claims databases, and EHR repositories maximizes strengths of each data type while overcoming inherent limitations of any single source. Clinical trials determine effectiveness, and combining insights from large-scale claims data with detailed clinical information in EHRs is crucial for assessing interventions as they transition from research to practical healthcare, contributing to overall healthcare improvement.

 

Aspect Clinical Trial Data Claims Data EHR Data
Primary Purpose Research and development of new treatments Billing and reimbursement for services Patient care and health record keeping
Data Source Controlled clinical studies Insurance companies, healthcare providers Healthcare providers
Data Types Included Patient demographics, treatment details, outcomes Patient demographics, services rendered, cost Patient demographics, medical history, diagnostics, treatment plans
Data Structure Highly structured and standardized Structured but varies with payer systems Structured and unstructured (e.g., doctor’s notes)
Temporal Span Limited to the duration of the trial Longitudinal, covering the duration of coverage Longitudinal, covering comprehensive patient history
Access and Privacy Restricted, subject to clinical trial protocols Restricted, governed by health insurance portability and accountability act (HIPAA) regulations Restricted, governed by HIPAA and patient consent
Primary Users Researchers, pharmaceutical companies Healthcare providers, payers, policy makers Healthcare providers, patients
Data Volume and Variety Relatively limited, focused on specific conditions Large, diverse covering a wide range of conditions and services Large, diverse, includes a wide range of medical information
Use in Healthcare Drug development, understanding treatment effectiveness Healthcare economics, policy making, fraud detection Direct patient care, diagnosis, treatment planning
Challenges Limited generalizability, high cost Variability in coding, potential for missing data Inconsistent data entry, variability in EHR systems

 

Sources:
https://www.ncbi.nlm.nih.gov/books/NBK11597/ 

https://pubmed.ncbi.nlm.nih.gov/10146871/

https://www.fda.gov/drugs/types-applications/new-drug-application-nda 

https://www.nia.nih.gov/research/blog/2017/06/pragmatic-clinical-trials-testing-treatments-real-world

Polygenic Risk Scores

An Introduction to Polygenic Risk Scores: Aggregating Small Genetic Effects to Stratify Disease Risk

By | Polygenic Rick Scores

Key Takeaways:

  • Polygenic risk scores aggregate the effects of thousands of genetic variants to estimate an individual’s inherited risk for complex diseases.
  • Polygenic risk is based on genome-wide association studies that identify common variants associated with modest increases in disease risk.
  • Polygenic scores provide risk stratification beyond family history, but most disease risk is not yet explained by known variants.
  • Clinical validity and utility of polygenic scores will improve as more disease-associated variants are discovered through large genomic studies.
  • Polygenic risk models may one day guide targeted screening and preventive interventions, but face challenges related to clinical interpretation and implementation.

Introduction to Polygenic Risk Scores

The vast majority of common, chronic diseases do not follow simple Mendelian inheritance patterns, but rather are complex genetic conditions arising from the combined small effects of thousands of genetic variations interacting with lifestyle and environmental factors. Polygenic risk scores aggregate information across an individual’s genome to estimate their inherited susceptibility for developing complex diseases like heart disease, cancer, diabetes and neuropsychiatric disorders.

Polygenic risk scores are constructed using data from genome-wide association studies (GWAS) that scan markers across the genomes of thousands to millions of individuals to identify genetic variants associated with specific disease outcomes. While most disease-associated variants have very small individual effects, the combined effect of thousands of these common, single nucleotide polymorphisms (SNPs) can stratify disease risk in a polygenic model.

Polygenic Scores vs. Single Gene Mutations

In monogenic diseases like cystic fibrosis and Huntington’s disease, a single genetic variant is necessary and sufficient to cause disease. Genetic testing for causal mutations in specific disease-linked genes provides a clear-cut diagnostic assessment. In contrast, no single gene variant accounts for more than a tiny fraction of risk for complex common diseases. Polygenic risk models aggregate the effects of disease-associated variants across the genome, each imparting a very modest increase or decrease in risk. An individual’s polygenic risk score reflects the cumulative impact of thousands of small risk effects spread across their genome.

While polygenic scores are probabilistic and estimate only inherited genetic susceptibility, monogenic mutations convey deterministic information about disease occurrence. However, for many individuals with elevated polygenic risk scores, modifiable lifestyle and environmental factors may outweigh their inherited predisposition, allowing prevention through early intervention.

GWAS and Polygenic Scores

Human genome-wide association studies utilize DNA microarray ‘chips’ containing hundreds of thousands to millions of SNPs across the genome. Comparing SNP frequencies between thousands of disease cases and controls reveals variations associated with disease diagnosis. Each SNP represents a common genetic variant present in more than 1-5% of the population. Individually, SNP effects on disease risk are very modest, usually less than 20% increase in relative risk.

However, by aggregating the effects of disease-associated SNPs, polygenic risk models can categorize individuals along a spectrum of low to high inherited risk. Polygenic scores typically explain 7-12% of disease variance, though up to 25% for some cancers. The more powerful the original GWAS in terms of sample size, the better the polygenic score will be at predicting an individual’s predisposition.

Constructing Polygenic Scores

Various methods exist for constructing polygenic scores after identifying disease-associated SNPs through GWAS. Most commonly, a SNP effect size is multiplied by the number of risk alleles (0, 1 or 2) for that SNP in a given individual. These products are summed across all chosen SNPs to derive an overall polygenic risk score. SNPs strongly associated with disease receive more weight than weakly associated markers.

Rigorous validation in independent sample sets evaluates the predictive performance of polygenic scores. Optimal SNP inclusion thresholds are selected to maximize predictive ability. Polygenic models lose power with too few or too many SNPs included. Ideal thresholds retain SNPs explaining at least 0.01% of disease variance based on GWAS significance levels.

Applications and Limitations

Polygenic risk models are currently most advanced for coronary artery disease, breast and prostate cancer, type 2 diabetes and inflammatory bowel disease. Potential clinical applications include:

  • Risk stratification to guide evidence-based screening recommendations beyond family history.
  • Targeted prevention and lifestyle modification for individuals at elevated genetic risk.
  • Informing reproductive decision-making and genetic counseling based on polygenic risk.
  • Improving disease prediction, subtyping and prognosis when combined with clinical risk factors.

However, limitations and ethical concerns exist around polygenic score implementation:

  • Most heritability remains unexplained. Adding more SNPs only incrementally improves prediction.
  • Polygenic testing may prompt unnecessary interventions if clinical validity and utility are not adequately demonstrated.
  • Possible psychological harm and discrimination from genetic risk probabilization.
  • Unequal health benefits if not equitably implemented across populations.

While polygenic scores currently identify individuals with modestly increased or decreased disease risks, their predictive utility is anticipated to grow exponentially with million-person biobank efforts and whole-genome sequencing. Harnessing the full spectrum of genomic variation contributing to polygenic inheritance will enable more personalized risk assessment and clinical decision-making for complex chronic diseases.

Sources:

  1. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018 Sep;19(9):581-590.
  2. Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020 May 6;12(1):44.
  3. Khera AV, Chaffin M, Zekavat SM, et al. Whole-genome sequencing to characterize monogenic and polygenic contributions in patients hospitalized with COVID-19. Nat Commun. 2021 Jan 20;12(1):536.
  4. Torkamani A, Erion G, Wang J, et al. An evaluation of polygenic risk scores for predicting breast cancer. Breast Cancer Res Treat. 2019 Apr;175(2):493-503.
  5. Mars N, Koskela JT, Ripatti P, Kiiskinen T TJ, Havulinna AS, Lindbohm JV, Ahola-Olli A, Kurki M, Karjalainen J, Palta P, FinnGen, Neale B, Daly M, Salomaa V, Palotie A, Collins F, Samani N, Ripatti S. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med. 2020 Nov;26(11):1660-1666.
Clinical Trials vs. Real-World

Clinical Trials vs. Real-World Data: Understanding the Differences and Complementary Roles

By | Clinical Trials

Key Takeaways:

  • Clinical trials are controlled experiments designed to evaluate safety and efficacy of new drugs or devices. Real-world data comes from more diverse, less controlled sources like electronic health records and medical claims.
  • Clinical trials have strict inclusion/exclusion criteria and measure predefined outcomes. Real-world data reflects broader populations with various comorbidities and outcomes.
  • Clinical trials are required for regulatory approval but have limitations like small sample sizes. Real-world evidence can complement trials with larger volumes of data over longer time periods.
  • Real-world data comes from routine clinical practice rather than protocol-driven trials. It provides supplementary information on effectiveness and safety.
  • Limitations of real-world data include lack of randomization, potential biases and confounders. Analytic methods help account for these limitations.
  • Real-world evidence has growing applications in medical product development, post-market surveillance, regulatory decisions and clinical guideline development.

Clinical Trials vs. Real-World Data

Clinical trials are prospective studies that systematically evaluate the safety and efficacy of investigational drugs, devices or treatment strategies in accordance with predefined protocols and statistical analysis plans. They are considered the gold standard for assessing the benefits and risks of medical interventions prior to regulatory approval. In clinical trials, participants are assigned to receive an investigational product or comparator/placebo according to a randomized scheme. These studies are designed to minimize bias and carefully control variables that may affect outcomes. Participants are closely monitored per protocol, and data is collected on prespecified points in time. The resulting evidence from randomized controlled trials serves as the primary basis for regulatory decisions regarding drug and device approvals.

In contrast, real-world data (RWD) refers to data derived from various non-experimental or observational sources that reflect routine clinical practice. Sources of RWD include electronic health records (EHRs), medical claims, registry data and patient-generated data from mobile devices, surveys or wearables. Real-world evidence (RWE) is the clinical evidence generated from aggregation and analysis of RWD. While clinical trials evaluate medical products under ideal, controlled conditions in limited samples of patients, RWD offers information about usage, effectiveness and safety in broader patient populations in real-world settings.

Some key differences between clinical trials and real-world data:

Sample Populations – Clinical trials have strict inclusion and exclusion criteria, resulting in homogeneous samples that often under represent minorities, elderly, pediatric and complex patient groups. RWD reflects more diverse real-world populations with various comorbidities and concomitant medications.

  • Settings – Clinical trials are conducted at specialized research sites under tightly controlled conditions. RWD comes from routine care settings like hospitals, clinics and pharmacies across diverse geographies and populations.
  • Interventions – Clinical trials administer interventions per protocol. RWD reflects variabilities in real-world treatment patterns and patient adherence.
  • Outcomes – Clinical trials measure prespecified outcomes over limited timeframes. RWD captures broader outcomes like patient-reported outcomes, quality of life, hospitalizations and costs over longer periods in real-world practice.
  • Data Collection – Clinical trials collect data per protocol at predefined assessment points. RWD is collected during routine care and reflected in patient records and claims.
  • Sample Size – Clinical trials often have small sample sizes with a few hundred to several thousand patients. RWD encompasses data from tens or hundreds of thousands of patients.
  • Randomization – Clinical trials use randomization to minimize bias when assigning interventions. RWD studies are observational without the benefits of randomization.

While randomized controlled trials provide high quality evidence for drug/device approvals and clinical recommendations, RWD offers complementary information on effectiveness, safety, prescribing patterns and health outcomes:

  • RWD can provide broader demographic representation for subpopulations underrepresented in trials.
  • RWD can inform on long-term safety, durability of treatment effects and comparative effectiveness between therapies.
  • RWD can provide larger sample sizes to study rare events or outcomes.
  • RWD can reflect real-world utilization rates, switching patterns and adherence to therapies.
  • RWD offers granular data for personalized medicine, risk identification, prediction modeling and tailored interventions.
  • RWD is more timely, cost-effective and scalable than conducting large trials.

However, RWD has inherent limitations compared to clinical trials:

  • Lack of randomization increases potential for bias and confounding.
  • Incomplete data or misclassification errors are common with medical records.
  • Inability to firmly conclude causality due to observational nature.
  • Possible selection biases and variations in care delivery across settings.
  • Inconsistencies in definitions, coding, documentation practices over time and sites.

Analytical methods help account for these limitations when generating real-world evidence from RWD:

  • Advanced analytics like machine learning can identify trends and associations within large RWD.
  • Predictive modeling and simulations can estimate treatment effects.
  • Adjusting for confounders, stratification, matching patients, propensity scoring help reduce biases.
  • Expert review of data and methodology helps ensure reliability.

Applications of RWE are expanding and gaining acceptance from key stakeholders:

  • Supplement clinical trial data for regulatory, coverage and payment decisions around medical products.
  • Post-market surveillance of drug and device safety and utilization in real-world practice.
  • Life cycle evidence generation for new indications, formulations, combination products.
  • Provide inputs into clinical guidelines by professional societies.
  • Risk identification/stratification, predictive modeling and personalized medicine.
  • Value-based contracting between manufacturers and payers.
  • Risk management and safety programs for hospitals and health systems.

In summary, clinical trials provide foundational evidence to introduce new medical products, while RWE offers complementary insights on effectiveness, safety, prescribing patterns and health outcomes at a larger scale across diverse real-world populations. Advanced analytics help derive meaningful RWE from RWD, with growing applications across the healthcare life science ecosystems. Together, these sources of evidence offer a multifaceted understanding to guide optimal use of medical products and improve patient care.

Sources:

  1. ClinicalTrials.gov. What are the different types of clinical research? https://clinicaltrials.gov/ct2/about-studies/learn#WhatIs
  2. Berger ML, et al. Real-World Evidence: What It Is and What It Can Tell Us According to the ISPOR Real-World Data Task Force. Value Health. 2021 Sep;24(9):1197-1204.
  3. Sherman RE, et al. Real-World Evidence – What Is It and What Can It Tell Us? N Engl J Med. 2016 Dec 8;375(23):2293-2297.
  4. Yu T, et al. Benefits, Limitations, and Misconceptions of Real-World Data Analyses to Evaluate Comparative Effectiveness and Safety of Medical Products. Clin Pharmacol Ther. 2019 Oct;106(4):765-778.
  5. Food and Drug Administration. Real-World Evidence. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence
Polygenic risk score

The Role of Polygenic Risk Scores in Clinical Genomics

By | Clinical Genomics

Introduction

We were promised the end to genetic diseases. All we needed to do was unlock the human genome. Unfortunately, life has a way of being more complicated than we expect. It turned out that many genetic disorders are the result of the interplay between multiple genetic factors. This set off the need for improved analytical tools to analyze human genetics that could interrogate the associations of many genetic backgrounds and link them to various diseases. One such technique, the Polygenic Risk Score (PRS), emerged as a powerful tool to quantify the cumulative effects of multiple genetic variants on an individual’s predisposition to a specific disease.

The Evolution of Polygenic Risk Scores

The genesis of PRS can be traced back to the early 2000s when researchers sought to comprehend the collective impact of multiple genetic variants on disease susceptibility. Initially viewed through a biological lens, the focus was on enhancing the prediction of diseases by analyzing subtle genomic variations. Studies concentrated on prevalent yet complex diseases such as diabetes, cardiovascular diseases, and cancer, laying the groundwork for a comprehensive understanding of their genetic architecture.

That was until Dr. Sekar Kathiresan showed that the prediction from a PRS was just as clinically useful as a single variant (Khera et al., 2018). Instead of looking at the percent of people with a PRS in each group (with or without a disease), his group could show a much more obvious effect – the difference in risk for people in the groups with the highest and lowest scores. Then, they could say that there was a huge difference in risk for these two edges of the population.

In the initial stages, PRSs consisted of only the most statistically significant variants from genome-wide association studies. Geneticists often added up the quantity of risk variants without giving them a weight for how much of an impact they had on whether someone would get a disease. Refining these scores led scientists to challenge arbitrary risk cutoffs and advocate for the inclusion of all variants to maximize statistical power (based on the assumption that, on average, variants that have no effect are evenly distributed to appear positively or negatively correlated to the trait). However, proximity of variants on a chromosome presented another challenge. If variants were closer together on a chromosome, they would be less likely to be separated during recombination (Linkage Disequilibrium). This would result in them carrying the signal of something that had a true effect, potentially leading to an overcounting of that signal.

To deal with this, geneticists used tools to remove signals within a specified block unless their correlation with the strongest signal fell below a threshold. One of the first packages, PRSice (Choi & O’Reilly, 2019), used an approach called Pruning and Thresholding. Scientists would choose a block size, say, 200,000 base pairs. A program would go through and slide that block along the genome. If there was more than a single signal in that block, the program would remove (or “prune”) all but the strongest signal unless the variant had a smaller correlation with the strongest signal than the “threshold”. The result was that in a region with many different variants that affected the risk of a disease, but which were still a bit correlated, signal could be lost.

Criticism from biostatisticians prompted a shift towards a Bayesian approach, reducing over-counting while better accounting for partially independent signals. Implementation was challenged by the extensive computational resources needed to update the signal at each genetic location based on linkage disequilibrium of the surrounding SNPs. One program, called PRS-CS (Ge et al., 2019), implemented a method that could apply changes to a whole linkage block at once, addressing both the geneticist demand for a good system that can provide results using the computation tools we have and the biostatistician demand for accuracy and retained information.

Despite these advancements, accuracy challenges persisted, particularly when applying scoring systems across populations with different genetic ancestries. It turned out Linkage Disequilibrium was a pervasive problem. The patterns of Linkage Disequilibrium are different in people with different genetic ancestries. In fact, even statistics about the patterns themselves, like how big an average block size is, are different. Recognizing the need for improvement, ongoing efforts in refining PRSs aim to address these challenges, paving the way for more accurate and reliable applications. As researchers delve deeper into these complexities, the evolving landscape of PRSs continues to shape the future of clinical research.

Polygenic Risk Scores in Clinical Research Settings

To harness the full potential of PRS in clinical practice, a crucial shift is needed—from population-level insights to personalized predictions for individual patients. This transformation involves converting relative risks, which compare individuals across the PRS spectrum with a baseline group, into absolute risks for the specific disease (Lewis & Vassos, 2020). The current emphasis is on identifying individuals with a high genetic predisposition to disease, forming the foundation for effective risk stratification. This information guides decisions related to participation in screening programs, lifestyle modifications, or preventive treatments when deemed suitable.

In practical applications, PRS demonstrates promise in patient populations with a high likelihood of disease. Consider a recent study in an East Asian population, where researchers developed a PRS for Coronary Artery Disease (CAD) using 540 genetic variants (Lu et al., 2022). Tested on 41,271 individuals, the top 20% had a three-fold higher risk of CAD compared to the bottom 20%, with lifetime risks of 15.9% and 5.8%, respectively. Adding PRS to clinical risk assessment slightly improved accuracy. Notably, individuals with intermediate clinical risk and high PRS reached risk levels similar to high clinical risk individuals with intermediate PRS, indicating the potential of PRS to refine risk assessment and identify those requiring targeted interventions for CAD.

Another application of PRS lies in improving screening for individuals with major disease risk alleles (Roberts et al., 2023). A recent breast cancer risk assessment study explored pathogenic variants in high and moderate-risk genes (Gao et al., 2021). Over 95% of BRCA1, BRCA2, and PALB2 carriers had a lifetime breast cancer risk exceeding 20%. Conversely, integrating PRS identified over 30% of CHEK2 and almost half of ATM carriers below the 20% threshold. Indeed, a similar result was found in a separate study when researchers investigated men with high blood levels of prostate-specific antigen (PSA). 

This trend extends to other diseases, such as prostate cancer, where a separate investigation focused on men with elevated levels of prostate-specific antigen (PSA) (Shi et al., 2023). Through the application of PRS, researchers pinpointed over 100 genetic variations linked to increased PSA levels. Ordinarily, such elevated PSA levels would prompt prostate biopsies to assess potential prostate cancer. By incorporating PRS into the screening process, doctors could have accounted for the natural variation in PSA level and prevent unnecessary escalation of clinical care. These two studies suggest that PRS integration into health screening enhances accuracy, preventing unnecessary tests and enabling more personalized risk management.

In the realm of pharmacogenetics, efforts to optimize treatment responses continue. While progress has been made in identifying rare high-risk variants linked to adverse drug events, predicting treatment effectiveness remains challenging. The evolving role of PRS in treatment response is particularly evident in statin use for reducing initial coronary events. In a real-world cohort without prior myocardial infarction, an investigation revealed that statin effectiveness varied based on CHD PRSs, with the highest impact in the high-risk group, intermediate in the intermediate-risk group, and the smallest effect in the low-risk group (Oni-Orisan et al., 2022). Post-hoc analyses like this for therapeutics could potentially allow for more targeted enrollment for clinical trial design, substantially reducing the number of participants needed to demonstrate trial efficacy (Fahed et al., 2022).

Conclusion

As the field of genetics continues to advance, PRSs emerge as a potent tool with the potential to aid clinical research. Validated PRSs show promise in enhancing the design and execution of clinical trials, refining disease screening, and developing personalized treatment strategies to improve the overall health and well-being of patients. However, it’s crucial to acknowledge that the majority of PRS studies heavily rely on biased datasets of European ancestry. To refine and improve PRS, a comprehensive understanding of population genetic traits for people of all backgrounds, such as linkage disequilibrium, is essential. Moving forward, the integration of PRS into clinical applications must prioritize datasets with diverse ancestry to ensure equitable and effective utilization across all patient backgrounds. As research in this field progresses, the incorporation of PRS is poised to become an indispensable tool for expediting the development of safer and more efficacious therapeutics.

References

Choi, S. W., & O’Reilly, P. F. (2019). PRSice-2: Polygenic Risk Score software for biobank-scale data. GigaScience, 8(7). https://doi.org/10.1093/gigascience/giz082

Fahed, A. C., Philippakis, A. A., & Khera, A. V. (2022). The potential of polygenic scores to improve cost and efficiency of clinical trials. Nature Communications, 13(1), 2922. https://doi.org/10.1038/s41467-022-30675-z

Gao, C., Polley, E. C., Hart, S. N., Huang, H., Hu, C., Gnanaolivu, R., Lilyquist, J., Boddicker, N. J., Na, J., Ambrosone, C. B., Auer, P. L., Bernstein, L., Burnside, E. S., Eliassen, A. H., Gaudet, M. M., Haiman, C., Hunter, D. J., Jacobs, E. J., John, E. M., … Kraft, P. (2021). Risk of Breast Cancer Among Carriers of Pathogenic Variants in Breast Cancer Predisposition Genes Varies by Polygenic Risk Score. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 39(23), 2564–2573. https://doi.org/10.1200/JCO.20.01992

Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1), 1776. https://doi.org/10.1038/s41467-019-09718-5

Khera, A. V., Chaffin, M., Aragam, K. G., Haas, M. E., Roselli, C., Choi, S. H., Natarajan, P., Lander, E. S., Lubitz, S. A., Ellinor, P. T., & Kathiresan, S. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50(9), 1219–1224. https://doi.org/10.1038/s41588-018-0183-z

Lewis, C. M., & Vassos, E. (2020). Polygenic risk scores: from research tools to clinical instruments. Genome Medicine, 12(1), 44. https://doi.org/10.1186/s13073-020-00742-5

Lu, X., Liu, Z., Cui, Q., Liu, F., Li, J., Niu, X., Shen, C., Hu, D., Huang, K., Chen, J., Xing, X., Zhao, Y., Lu, F., Liu, X., Cao, J., Chen, S., Ma, H., Yu, L., Wu, X., … Gu, D. (2022). A polygenic risk score improves risk stratification of coronary artery disease: a large-scale prospective Chinese cohort study. European Heart Journal, 43(18), 1702–1711. https://doi.org/10.1093/eurheartj/ehac093

Oni-Orisan, A., Haldar, T., Cayabyab, M. A. S., Ranatunga, D. K., Hoffmann, T. J., Iribarren, C., Krauss, R. M., & Risch, N. (2022). Polygenic Risk Score and Statin Relative Risk Reduction for Primary Prevention of Myocardial Infarction in a Real-World Population. Clinical Pharmacology and Therapeutics, 112(5), 1070–1078. https://doi.org/10.1002/cpt.2715

Roberts, E., Howell, S., & Evans, D. G. (2023). Polygenic risk scores and breast cancer risk prediction. Breast (Edinburgh, Scotland), 67, 71–77. https://doi.org/10.1016/j.breast.2023.01.003

Shi, M., Shelley, J. P., Schaffer, K. R., Tosoian, J. J., Bagheri, M., Witte, J. S., Kachuri, L., & Mosley, J. D. (2023). Clinical consequences of a genetic predisposition toward higher benign prostate-specific antigen levels. EBioMedicine, 97, 104838. https://doi.org/10.1016/j.ebiom.2023.104838

drug response data

Bridging the Gap: How AI Companies in the TechBio Space are Revolutionizing Biopharma using Genomics and Drug Response Data

By | AI | No Comments

Introduction

Innovations in Artificial Intelligence (AI) have propelled pharmaceutical companies to revolutionize their approaches to designing, testing, and bringing precision medicine and healthcare solutions to the market. Two key elements in advancing precision medicine include early disease detection and understanding drug responders within distinct populations. By leveraging genomics and clinical notes, AI companies, specifically in the TechBio space, are transforming the way biopharma industries identify, understand, and cater to individuals rather than whole populations.

The Challenge: Precision Medicine and Drug Response

Traditional drug development methods often analyze the success of a drug treatment as its effect on a patient population, leading to highly variable outcomes and adverse effects among individual patients. This is despite the fact that for many diseases the underlying mechanisms driving symptoms can be quite different from person to person. This lack of individualization in treatment can hinder therapeutic efficacy at the group level despite effectiveness for certain individuals. If we hope to accelerate drug development to get cures in the hands of people faster, future research needs intelligent, cost-effective methods to stratify patients based on the contribution of different disease mechanisms and drug processing capabilities. AI companies are helping biopharma address this challenge by incorporating genomics and insights garnered from those individual’s de-identified patient clinical charts in a systematic way.

Genomics: The Blueprint of Personalization

The genomic revolution has undoubtedly paved the way for precision medicine in Biopharma. By analyzing an individual’s genetic data, scientists can identify variations that may influence drug metabolism and response. This approach has already proven highly effective, particularly in the case of breast cancer patients. In some instances of breast cancer, there is an overexpression of the HER2/neu protein (Gutierrez & Schiff, 2011). When genomic markers for this overexpression are identified, anti-HER2 antibodies can be incorporated into the treatment regimen, significantly enhancing survival rates. AI companies are at the forefront of continuing this research by utilizing genomics for the creation of genetic sub-groups essential for biomarker discovery and predicting the most effective drug treatments for individual patients (Quazi, 2022).

Disease Detection and Monitoring with AI-Enhanced Biomarker Research

Early detection and monitoring of disease progression are paramount for improving patient survival rates. Traditionally, biomarker research has focused on identifying individual molecules or transcripts that can serve as early indicators of future severe illness. However, the field is evolving beyond the notion of a single-molecule biomarker diagnostic. Instead, it is turning to AI to examine the relationships between molecules and transcripts, offering a more comprehensive approach to identifying the onset of significant diseases (Vazquez-Levin et al., 2023). Over the past decade, cancer research and clinical decision-making have undergone a significant transformation, shifting from qualitative data to a wealth of quantitative digital information.

Universities and clinical institutions globally have contributed a vast trove of biomarkers and imaging data. This extensive dataset encompasses insights from genomics, proteomics, metabolomics, and various omics disciplines, as well as inputs from oncology clinics, epidemiology, and medical imaging. AI, uniquely positioned to integrate this diverse information, holds the potential to spearhead the development of pioneering predictive models for drug responses, paving the way for groundbreaking advancements in disease diagnosis, treatment prediction, and overall decision-making concerning novel therapies. 

With growing collections of data, it is becoming easier to model how a drug will shift an individual’s biology for worse or better. A recent example of this modelling is in the Cancer Patient Digital Twin (CPDT) project, where, the collection of multimodal temporal data from cancer patients can be employed to build a Digital Twin (a virtual replica of a patient’s biological processes and health status), allowing for in silico experimentation, which may guide testing, treatment, or decision points (Stahlberg et al., 2022).

One example is how the detection of metastatic disease over time could be improved from radiology reports. Researchers exposed prediction models to historical information using Natural Language Processing (NLP) (Batch et al., 2022). The authors were able to extract and encode relevant features from medical text reports, and use these features to develop, train, and validate models. Over 700 thousand radiology reports were used for model development to predict the presence of metastatic disease. Results from this study suggest that NLP models can extract cancer progression patterns from multiple consecutive reports and predict the presence of metastatic disease in multiple organs with higher performance than previous analytical techniques. Early knowledge of disease states or disease risk could lead to revised risk:benefit assessments for treatments and testing, potentially influencing patients’ choices. As a result, patients with otherwise comparable profiles may opt for treatments or tests they would not have otherwise considered. Even in cases where we do not have good biomarkers for disease (for example, Alzheimer’s disease, where most of the biomarkers are quite invasive to collect), knowing that a person has a higher disease risk earlier can enable important research that can lead to better biomarkers and, ultimately, better treatments.     

AI-Driven Pharmacogenomics: Revolutionizing Precision Medicine and Clinical Trials

While traditional approaches have paved the way for tailored medical treatments, the integration of AI can supercharge these efforts by leveraging an individual’s genetic information. For instance, consider the case of Warfarin, a widely prescribed anticoagulant. Accurate dosing for Warfarin is critical during the start of treatment, which carries higher risks of bleeding and clotting issues. Over decades, dose-response models have been developed to better understand how this drug affects the human body (Holford, 1986). To improve on Warfarin anticoagulation therapy, algorithms have incorporated genetic information to aid in identifying the factors behind clotting issues like Warfarin clearance rate, improving dosage and therapy (Gong et al., 2011). 

Now, with the power of AI, researchers can expedite the personalization of treatments for various disorders and medications, similar to what was accomplished with Warfarin but in a fraction of the time. AI algorithms are starting to analyze an individual’s genetic profile to predict their specific responses to various medications. This approach enables healthcare providers to fine-tune treatment plans, taking into account an individual’s unique genetic makeup, thus optimizing the effectiveness of therapies and reducing the potential for adverse effects. The integration of AI not only enhances the precision of pharmacogenomics but also streamlines the process, ultimately leading to safer and more efficient medical care tailored to each patient’s genetic characteristics.

The ultimate aspiration is to develop a sophisticated AI-driven system that can accurately forecast how each individual will react to specific medications, with the potential to bypass the conventional, time-consuming method of starting with the lowest effective dose and incrementally adjusting it upwards. This trial-and-error approach often leads to prolonged periods of uncertainty and potential adverse side effects for patients. Such advancements not only boost the precision of healthcare but also elevate the overall quality of life for patients seeking rapid relief and improved well-being.

Moreover, the integration of AI in pharmacogenomics has the potential to significantly expedite clinical trial programs. By tailoring medication doses to specific genetic backgrounds, AI aids at all three phases of the clinical trial process. This approach not only streamlines the trials but also offers substantial time and cost savings. The ability to tailor treatments for different genetic subgroups ensures that clinical trials are more efficient, bringing new therapies to market faster and ultimately benefiting patients in need.

Conclusion

The union of genomics and clinical notes, facilitated by AI, is ushering in a new era of precision medicine in biopharma. With the ability to predict individual drug responses and identify targeted therapies, this approach holds immense promise for improved treatment outcomes and a patient-centric view of medicine. As AI companies continue to advance their capabilities, the future of precision medicine for many diseases is looking closer than ever. The key to unlocking its full potential lies in the availability of high-quality data that comprehensively spans the entire patient journey. The integration of such diverse health-related data is central to driving valuable insights for drug development, making AI a driving force in the future of healthcare.

 

Citations:

Batch, K. E., Yue, J., Darcovich, A., Lupton, K., Liu, C. C., Woodlock, D. P., El Amine, M. A. K., Causa-Andrieu, P. I., Gazit, L., Nguyen, G. H., Zulkernine, F., Do, R. K. G., & Simpson, A. L. (2022). Developing a Cancer Digital Twin: Supervised Metastases Detection From Consecutive Structured Radiology Reports. Frontiers in Artificial Intelligence, 5. https://doi.org/10.3389/frai.2022.826402

Gong, I. Y., Schwarz, U. I., Crown, N., Dresser, G. K., Lazo-Langner, A., Zou, G., Roden, D. M., Stein, C. M., Rodger, M., Wells, P. S., Kim, R. B., & Tirona, R. G. (2011). Clinical and genetic determinants of warfarin pharmacokinetics and pharmacodynamics during treatment initiation. PloS One, 6(11), e27808. https://doi.org/10.1371/journal.pone.0027808

Gutierrez, C., & Schiff, R. (2011). HER2: biology, detection, and clinical implications. Archives of Pathology & Laboratory Medicine, 135(1), 55–62. https://doi.org/10.5858/2010-0454-RAR.1

Holford, N. H. (1986). Clinical pharmacokinetics and pharmacodynamics of warfarin. Understanding the dose-effect relationship. Clinical Pharmacokinetics, 11(6), 483–504. https://doi.org/10.2165/00003088-198611060-00005

Quazi, S. (2022). Artificial intelligence and machine learning in precision and genomic medicine. Medical Oncology (Northwood, London, England), 39(8), 120. https://doi.org/10.1007/s12032-022-01711-1

Stahlberg, E. A., Abdel-Rahman, M., Aguilar, B., Asadpoure, A., Beckman, R. A., Borkon, L. L., Bryan, J. N., Cebulla, C. M., Chang, Y. H., Chatterjee, A., Deng, J., Dolatshahi, S., Gevaert, O., Greenspan, E. J., Hao, W., Hernandez-Boussard, T., Jackson, P. R., Kuijjer, M., Lee, A., … Zervantonakis, I. (2022). Exploring approaches for predictive cancer patient digital twins: Opportunities for collaboration and innovation. Frontiers in Digital Health, 4, 1007784. https://doi.org/10.3389/fdgth.2022.1007784

Vazquez-Levin, M. H., Reventos, J., & Zaki, G. (2023). Editorial: Artificial intelligence: A step forward in biomarker discovery and integration towards improved cancer diagnosis and treatment. Frontiers in Oncology, 13, 1161118. https://doi.org/10.3389/fonc.2023.1161118

EHR Data

Claims Data vs EHRs: Distinct but United in Real-World Research

By | EHR

In healthcare research, real-world data from patients is invaluable for gaining insights into disease patterns, treatment effectiveness, and outcomes. Two major sources of real-world data are claims data and electronic health records (EHRs). Both data types have distinct advantages and limitations that impact their utility for different research applications. This article examines the key differences between claims data and EHR data and how researchers can leverage both data sources to answer critical healthcare questions.

What is Claims Data?

Health insurance claims data contains information submitted by healthcare providers to payers to receive reimbursement for services rendered to patients. Claims data includes demographic details about the patient such as age, gender, location, insurance details, diagnosis codes, procedure codes, prescription details, costs and reimbursement information. Claims data provides a longitudinal view of a patient’s interactions across healthcare systems, as it captures data each time a claim is filed over months or years of care (Pivovarov et al., 2019).

Large claims clearinghouses aggregate data from millions of patients across different payers, providing massive real-world datasets. For example, IBM MarketScan databases contain claims information on over 240 million US patients collected from employers, health plans and government health programs (IBM Watson Health, 2022). Other major claims aggregators include Optum Clinformatics, Premier Healthcare Database and PharMetrics Plus.

Key Details Captured in Claims Data

  • Patient demographics – age, gender, location, insurance details
  • Diagnoses – ICD diagnosis codes
  • Procedures – CPT and HCPCS procedure codes
  • Medications – NDC codes, dose, number of prescription refills
  • Costs – total and itemized costs, amount paid by payer and patient responsibility

Claims data is extremely valuable for comparative effectiveness research, pharmacoepidemiology, health economics and outcomes research (Berger et al., 2017). The large sample sizes and longitudinal view make claims databases ideal for studying disease incidence, treatment patterns, medication adherence, healthcare costs and utilization across different patient demographics and      therapeutic areas.

Limitations of Claims Data

While claims data offers unparalleled scale and longitudinal perspective, researchers must be aware of its limitations:

  • Diagnosis codes are not indicative of a confirmatory diagnosis; for example, a provider might submit a claim with a diagnosis code that is being considered during a diagnostic workup. 
  • Diagnosis and procedure codes may be inaccurate or incomplete if providers submit improper codes. Important clinical details are missing.
  • Prescription records lack information about whether the medication was taken as directed or refilled properly.
  • Available data elements are restricted to what is required for reimbursement. No additional clinical context is provided.
  • Inability to link family members or track individuals who change payers over time.
  • Variable data quality and completeness across different claims sources.
  • Biased sampling based on specific payer population characteristics. May not represent the general population.

Despite these limitations, claims data remains highly useful for epidemiologic studies, health economics research, population health analyses and other applications where large sample sizes are critical. Researchers should account for the nuances of claims data during study design and analysis.

What are Electronic Health Records?

Electronic health records (EHRs) are a digital documentation of patient health information generated throughout clinical care. EHRs are maintained by healthcare organizations and contain various data elements documenting patient encounters, including (Hersh et al., 2013):

  • Demographics – age, gender, race, ethnicity, language
  • Medical history – conditions, diagnoses, allergies, immunizations, procedures
  • Medications – prescriptions, dosing instructions
  • Vital signs – blood pressure, heart rate, weight, height
  • Lab test results
  • Radiology images
  • Clinical notes – physician progress notes, discharge summaries

A key advantage of EHR data is its rich clinical context. While claims data only captures billing codes, EHRs include detailed narratives, quantitative measures, images and comprehensive documentation of each patient visit. This facilitates better understanding of disease presentation & progression, treatment rationale & response and patient complexity.

EHR databases aggregate records across large healthcare networks to compile real-world data on millions of patients. For instance, Vanderbilt University Medical Center’s Synthetic Derivative database contains de-identified medical records for over 3.7 million subjects and their BioVU® database contains over 310,000 DNA samples linked to de-identified medical records for genomics research (Roden et al., 2008).

EHR Data

Benefits of EHR Data

EHR data enables researchers to (Cowie et al., 2017):

  • Obtain granular clinical details beyond billing codes
  • Review physician notes and narratives for patient context
  • Link lab results, pathology reports, radiology images for additional phenotyping
  • Study unstructured data through natural language processing
  • Identify patient cohorts based on complex inclusion/exclusion criteria
  • Examine longitudinal disease patterns and treatment journeys

EHR data yields insights unattainable through claims data alone. The rich clinical details enable researchers to understand nuances in patient populations, disease manifestation and therapy response.

Challenges with EHR Data

While valued for its clinical context, EHR data also has some inherent limitations:

  • Incomplete or missing records if providers fail to properly document encounters
  • Incomplete records if patient receives care at multiple, unlinked healthcare networks
  • Inconsistent use of structured fields vs free text notes across systems
  • Lack of national standards in data formats, terminologies and definitions
  • Biased datasets dependent on specific health system patient population
  • Difficulty normalizing data across disparate EHR systems
  • Requires data science skills to analyze unstructured notes and documents
  • Requires clinical background to appropriately interpret unstructured notes and documents
  • More resource intensive for data extraction and processing compared to claims data

EHR data analysis requires specialized skills and infrastructure, especially to interpret unstructured data. Despite limitations, EHRs remain an invaluable data source on their own or as complements to other data sources like claims for comprehensive real-world evidence generation.

Integrating Claims and EHR Datasets

Given the complementary strengths of claims data and EHRs, there is significant value in integrating these datasets to conduct robust real-world studies. This can be accomplished by (Maro et al., 2019):

  • Linking claims and EHR data at the patient level via unique identifiers
  • Building cohorts based on diagnosis codes from claims data, then reviewing clinical data for each patient in the EHR
  • Using natural language processing on EHR notes to extract additional details not available in claims
  • Applying claims analysis algorithms on EHR data to identify lapses in care, adverse events, etc.
  • Incorporating prescription fills from claims with medication orders in EHRs to assess adherence
  • Using cost data from claims combined with clinical data for health economic studies

Major research networks like PCORnet have developed infrastructure to integrate claims and EHR data to support large-scale patient-centered outcomes research. When thoughtfully combined, these complementary data sources enable multifaceted real-world studies not possible using either source alone.

     Claims data and EHRs both provide invaluable real-world evidence on patient populations, but have distinct strengths and limitations. Claims data allows longitudinal analysis of diagnosis, procedure and prescription patterns at scale, but lacks clinical granularity. EHRs provide rich clinical context like physician notes, lab results and images, but lack continuity across health systems and data standardization. By integrating these sources, researchers can conduct robust real-world studies leveraging the advantages of both datasets. Careful consideration of the nuances of each data type allows generation of comprehensive real-world evidence to inform healthcare decisions and improve patient outcomes.

At NashBio, we use EHR data for most of our analytic activities because of its depth and additional clinical context, which helps us build the highest fidelity study populations for our clients.

References

Berger, M. L., Sox, H., Willke, R. J., Brixner, D. L., Eichler, H.-G., Goettsch, W., … Schneeweiss, S. (2017). Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety, 26(9), 1033–1039. https://doi.org/10.1002/pds.4297

Cowie, M. R., Blomster, J. I., Curtis, L. H., Duclaux, S., Ford, I., Fritz, F., … Zalewski, A. (2017). Electronic health records to facilitate clinical research. Clinical Research in Cardiology, 106(1), 1–9. https://doi.org/10.1007/s00392-016-1025-6

Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R. O., Bernstam, E. V., Lehmann, H. P., Hripcsak, G., Hartzog, T. H., Cimino, J. J., & Saltz, J. H. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical care, 51(8 0 3), S30–S37. https://doi.org/10.1097/MLR.0b013e31829b1dbd

IBM Watson Health. (2022). IBM MarketScan Databases. https://www.ibm.com/products/marketscan-research-databases

Maro, J. C., Platt, R., Holmes, J. H., Stang, P. E., Steiner, J. F., & Douglas, M. P. (2019). Design of a study to evaluate the comparative effectiveness of analytical methods to identify patients with irritable bowel syndrome using administrative claims data linked to electronic medical records. Pharmacoepidemiology and Drug Safety, 28(2), 149–157. https://doi.org/10.1002/pds.4698

Pivovarov, R., Albers, D. J., Hripcsak, G., Sepulveda, J. L., & Elhadad, N. (2019). Temporal trends of hemoglobin A1c testing. Journal of the American Medical Informatics Association, 26(1), 41–48. https://doi.org/10.1093/jamia/ocy137

Roden, D. M., Pulley, J. M., Basford, M. A., Bernard, G. R., Clayton, E. W., Balser, J. R., & Masys, D. R. (2008). Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clinical pharmacology and therapeutics, 84(3), 362–369. https://doi.org/10.1038/clpt.2008.89