Skip to main content
All Posts By


Crafting Compelling Stories from Clinical Trial Data: Leveraging Real-World Insights

By Clinical Trials

Key Takeaways:


  • Integrating elements of storytelling enriches the presentation of clinical trial data, making it more engaging and informative.
  • Analyzing raw clinical trial data reveals hidden trends and patterns upon which to build a compelling story.
  • Adhering to ethical and regulatory standards is imperative when crafting narratives from clinical trial findings.
  • Augmenting clinical trial data with real-world outcomes allows for a more comprehensive understanding of treatment effectiveness, representing diverse demographics and lifestyles often underrepresented in the controlled clinical trial environment.
  • Communicating the real-world impact of treatments beyond statistical outcomes is crucial for showcasing scientific advancement to a broad audience.


In pharmaceutical research and development, the path from clinical trials to market is strictly defined. It involves analyzing data, conducting research, and ensuring compliance with regulations. It is essential to follow these steps to ensure the safety and efficacy of the drug. Although clinical trial data are often presented scientifically, they contain the potential for powerful storytelling beyond statistical tables and regulatory filings.



In this blog post, we explore the art of storytelling with clinical trial data and discuss how real-world perspectives can augment these insights.


Structuring the Story

Great storytelling hinges on certain fundamental elements. Below are a few of these elements and ways each can be applied to clinical trial data to help inform and engage the audience.


  • Character. In clinical trials, “character” refers to the intervention under study. Highlight the features, benefits, and potential impacts of the intervention.
  • Plot. Create a clear storyline. Start with the problem (condition or disease), introduce an intervention (treatment), describe the process (methodology), and conclude with the results.
  • Setting. Provide context by explaining the background and significance of the studied condition and why the clinical trial is crucial for advancing treatment.
  • Conflict. Discuss challenges faced during the clinical trial, such as recruitment difficulties or unexpected side effects.
  • Resolution. Share the clinical trial outcomes. Present statistical results, efficacy, safety, and any breakthrough findings.
  • Visual aids. Incorporate visuals such as infographics, charts, and interactive dashboards to make complex information more accessible and engaging.


Uncovering the Story

Interpreting clinical trial results can be challenging. To ensure findings are accessible to a broad audience, it is important to construct a compelling narrative around the data.


Clinical trial data serve as the backbone of drug development. In raw form, these data include patient demographics, treatment protocols, treatment responses, adverse events, and efficacy outcomes presented as numbers, tables, and graphs. However, analyses are often required to help uncover the stories within clinical trial data.


Analyzing clinical trial data allows for realizing trends and patterns that are not immediately apparent. Whether it’s the impact of a specific treatment on a particular subgroup or long-term effects beyond the clinical trial period, these insights contribute to a richer and more comprehensive storyline.


Beyond numerical outcomes and statistical significance, the impact of treatment extends to its ability to address unmet medical needs and improve patient outcomes. A balanced presentation of risks and benefits empowers healthcare professionals, policymakers, and patients to understand the data’s implications and make conscientious decisions.


Maintaining ethical standards and transparency is crucial to this process. Adhering to regulatory guidelines and clearly articulating limitations and biases ensures integrity in storytelling with clinical trial data.


Supporting the Story

Clinical trials are only one phase of a drug development lifecycle. Understanding the long-term safety and effectiveness of a treatment often requires evaluation beyond the controlled clinical trial environment.


For example, supplementing clinical trial data with real-world outcomes helps validate hypotheses, identify potential biomarkers, and uncover post-marketing insights. This approach accommodates variations in demographics, socioeconomics, and lifestyle choices often excluded from controlled clinical trials.


In addition, communicating how a treatment translates into improved quality of life or reduced healthcare burden further solidifies the narrative.



Clinical trial data are critical for advancing healthcare but are often challenging to interpret. Transforming clinical trial data into compelling stories involves focusing on the character, plot, setting, conflict, and resolution of the study while enriching the data with real-world perspectives. Crafting narratives that resonate with diverse stakeholders is crucial for conveying the true impact of treatments driving research, and advancing healthcare for the benefit of everyone.

human genetics

From Genes to Drugs: The Role of Genetics in Modern R&D

By Clinical Genomics

Key Takeaways:


  • Human genetics research can elucidate mechanisms of disease and help identify new drug targets.
  • Studying genetic variants linked to disease risk or drug response helps stratify patients and inform clinical trials.
  • Genomic data enables the development of precision medicines targeted to patients’ genetic profiles.
  • Pharmocogenomics and genetic screening guides optimal drug usage and minimizes adverse reactions.
  • Advancements in genetic analysis technologies are enabling more rapid and expansive use of genomic data in drug R&D.


The Value of Human Genetics in Drug R&D

Developing new drugs is a lengthy and expensive process with a high failure rate. On average, it takes 10-15 years and over $1 billion to bring a new drug to market. The pharmaceutical industry is looking to human genetics research to improve R&D efficiency, success rates and the personalized utility of new medicines.


Understanding the genetic factors underlying diseases can point the way to new drug targets. Identifying genetic variants linked to disease risk helps elucidate biological pathways involved. Druggable targets can then be identified to modulate relevant pathways and processes. Genetics also helps establish causal mechanisms to avoid spurious associations.


Pharmacogenomics focuses on how genetic variability affects drug response. It enables matching patients to treatments according to genotype to maximize effectiveness and avoid adverse reactions. Testing for pharmacogenomic biomarkers can guide dosing, or indicate alternate treatments when genetics point to likely non-response.


Genetic screening also aids patient stratification and clinical trial optimization. Enriching trial participant selection for those most likely to respond or exhibit a clinical effect improves statistical power with smaller sample sizes. Genetic variables allow better control for confounding factors. Pharmacogenomic testing of participants also helps explain differential responses.


Studying rare genetic variants with large effects (“genetic supermodels”) provides another window into disease biology. The study of extreme genotypes helps unravel mechanisms and identify new targets.


Once a drug is developed, genetics continues to inform optimal use. Screening programs using pharmacogenomic biomarkers guide treatment choices and minimize risks. Genetics also aids mechanistic understanding of how therapies work, illuminating additional applications and opportunities.


The plummeting costs of genome sequencing and advances in big data analytics are enabling more extensive use of human genetic data. Pete Hulick, lead for molecular biology at Eli Lilly, described human genetics as “intersecting with everything that we do” in drug R&D.


Applications in Discovery Research

Early in the R&D process, human genetic insights can point the way to promising disease targets. Scientists look for associations between genetic variants, such as single nucleotide polymorphisms (SNPs), and disease risk. Genome-wide association studies (GWAS) uncover SNP differences between disease and control cohorts. Significant associations indicate genes and biological pathways involved in the disease that may be amenable to pharmacological intervention.

Once potential targets are identified, downstream lab research explores how to modulate them. Developing a drug is an iterative process, but human genetics provides clues on where to start.

Genetics also offers validation when biological hypotheses emerge from other experiments. Confirming that tweaking a gene or pathway affects disease risk strengthens the case for pursuing it as a drug target.


Patient Stratification & Clinical Trials

Patient heterogeneity is a major obstacle in clinical trials. Varied treatment responses lower statistical power and necessitate larger trial sizes. Genetic analysis enables better patient stratification to minimize heterogeneity and identify relevant subgroups.


For example, the cystic fibrosis drug Kalydeco works for patients with a particular CFTR gene mutation. Prescreening patients’ genetics enables targeted trial recruitment. Similar approaches minimize heterogeneity in cancer trials by selecting patients with tumors exhibiting specific mutations.


Genotyping trial participants helps explain differential responses and may uncover additional genotype-specific effects. Genetic associations can also point to new indications for the drug mechanism.


Precision Medicine

The emergence of targeted precision therapies relies directly on human genetics. Cancer treatments like Herceptin and Gleevec target tumors with specific genomic variants. HIV drugs are tailored to individual viral genotypes. Gene therapies introduce corrected genes to compensate for defective inherited genes.


This personalized approach promises greater efficacy for those most likely to respond. By targeting drugs based on genetic profiles, precision medicine seeks to maximize benefit while minimizing unnecessary treatment.


Pharmacogenomics for Safety & Optimization

Pharmacogenomic testing assesses how genetic variability affects reactions to drugs. It can identify patients likely to experience adverse events or suboptimal responses. This enables selecting safer treatments, dosage adjustment or more intense monitoring.


The blood thinner warfarin, for example, demonstrates significant pharmacogenomic effects. Genotyping helps guide ideal dosing to balance effectiveness and bleeding risks. The FDA added pharmacogenomic guidance on warfarin labeling in 2007.


Wider adoption of pharmacogenomic testing has the potential to reduce adverse drug events that represent a significant public health burden. More optimal treatment through genetic guidance also contributes to pharmacoeconomic goals.


Looking Ahead

The expanding use of human genetics is transforming every phase of drug R&D. While challenges remain in interpreting and applying genetic findings, the value in accelerating discovery, precision medicine and optimized therapeutics is evident. Advances in high-throughput genomics, big data analytics    and machine learning will further incorporate human genetics into tomorrow’s medicines.



  • Relling & Evans, Nature Reviews Drug Discovery 2015
  • Roden & Denny, Annual Review of Medicine 2019
  • Genomics England PanelApp Pharmacogenetics Gene Curation Group,NPJ Genomic Medicine 2020
  • Li et al., Nature Reviews Genetics 2020
  • Manolio et al., JAMA 2020
  • Xu et al., Nature Reviews Drug Discovery 2022

The Basics of OMOP – Data Standardization in Healthcare

By Health Data Types, Healthcare Data

Key Takeaways:

  • OMOP Defined: The Observational Medical Outcomes Partnership (OMOP) is a common data model for organizing healthcare data from various sources.
  • Objective: OMOP aims to standardize and integrate diverse healthcare data facilitating analysis and research.
  • Data Structuring: It organizes data into standard tables and fields (observations, procedures, drug exposures, conditions, etc.), enhancing analytics across datasets.
  • Enhanced Analysis: A common data model allows for larger data pooling, increasing statistical power in analysis.
  • Privacy Protection: OMOP prioritizes patient privacy, using de-identified data while retaining analytical utility.


What is OMOP Data?

OMOP represents a collaborative effort to standardize the transformation and analysis of healthcare data from diverse sources. Its goal is to optimize observational data for comparative research and analytics. The OMOP Common Data Model (CDM) prescribes a structured format for organizing heterogeneous healthcare data, encompassing demographics, encounters, procedures and more. This facilitates cross-platform analytics and queries. Notably, OMOP is a blueprint for data organization, not a database. It supports data standardization across platforms, leading to more robust datasets.


Key Features of OMOP:

  • Vocabulary Standards: For coding concepts like conditions and medications.
  • Standard Formats: For dates, codes and relational data structures.
  • Person-Centric Model: Data connected to individuals over time.
  • Support for Various Data Types: Like EHR, claims, registries, etc.
  • Open Source Licensing: Promotes free implementation and continuous evolution of the standard.


OMOP’s standardization ensures key clinical concepts are represented uniformly, balancing analytical utility with patient privacy.


Use of OMOP Data:

OMOP facilitates practical medical research by standardizing observational data, enabling:


  • Cross-platform analytics on combined datasets.
  • Reproduction of analyses and sharing of methods.
  • Application of predictive models across diverse data types.
  • Support for safety surveillance and pharmacovigilance.
  • Conducting population health studies and comparative effectiveness research.


Implemented by a variety of organizations, OMOP enables significant analytical use cases, including drug safety signal detection, real-world treatment outcome analysis and population health forecasting. By creating a common language for healthcare data, OMOP fosters data integration and analysis on a larger scale, accelerating health research.


Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model Specifications

Hripcsak, G., Duke, J.D., Shah, N.H., Reich, C.G., Huser, V., Schuemie, M.J. et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015

Overview of the OMOP Common Data Model –

human genetics

Human Genetics as a Strategic Imperative to Accelerate Drug Discovery: The Alliance for Genomic Discovery

By Clinical Genomics

Key Takeaways:

  • Pharmaceutical development is high-risk and resource-intensive, with a 90% failure rate in clinical trials, often due to inadequate efficacy, toxicity, drug properties, or commercial viability.
  • Incorporating human genetic evidence doubles drug approval rates, paving the way for innovative therapies and new molecular entities.
  • Techniques like GWASs and PheWAS linking genetic data to phenotypic data enhance drug development by identifying associations between rare alleles and diseases.
  • Published human genetic studies, primarily centered on individuals of European descent, hinder our understanding of genetic diversity and impede the development of new therapies suitable for diverse populations; therefore, establishing study cohorts with under-represented populations is crucial for promoting health equality and identifying novel drug targets based on diverse genetic variants.
  • The Alliance for Genomic Discovery (AGD) aims to reshape drug development by sequencing 250,000 diverse samples, providing a powerful resource for pharmaceutical members to correlate genetic variations with clinical outcomes and, in turn, enabling these companies to better serve a global population.


The Struggle to Discover New Therapies

Discovering and developing pharmaceuticals is a resource-intensive and high-risk endeavor, sometimes spanning 15 years with costs exceeding $2 billion for their approval (Hinkson et al., 2020). Shockingly, about nine out of ten potential therapies, upon progressing to clinical trials, fail before approval (Dowden & Munro, 2019; Sun et al., 2022). The four primary contributors to the staggering 90% failure rate in drug development are inadequate clinical efficacy, unmanageable toxicity, suboptimal drug-like properties and a lack of commercial viability (Dowden & Munro, 2019; Harrison, 2016; Sun et al., 2022). To increase the chances of a drug target passing these critical checkpoints, considerable endeavors can be directed towards incorporating human genetic evidence into drug development.


In the drug development pipeline, all compounds before entering clinical phases must undergo rigorous testing in animal models, providing significant evidence of their potential to treat diseases. However, despite promising results in preclinical studies, the translation of efficacy and safety from animal models to human clinical trials is often elusive. Integrating human genetic evidence into the drug development process has recently emerged as a crucial strategy to navigate this challenge. Drugs grounded in such evidence exhibit a twofold increase in approval rates (Nelson et al., 2015), contributing to a higher prevalence of first-in-class therapies and new molecular entities (NMEs) (King et al., 2019). This not only accelerates the approval process but also streamlines the discovery of more effective and targeted treatments. Leveraging human genetic data empowers researchers with valuable insights into the genetic basis of diseases, facilitating the identification of better drug targets. The substantial presence of genetic evidence in FDA-approved drugs in 2021 (Ochoa et al., 2022) underscores its instrumental role in advancing drug discovery and fostering the emergence of innovative pharmaceutical solutions.


Linking Genetics to Clinical Data for Drug Discovery

To incorporate genetics into therapeutic development, researchers can link the genetic code of an individual to their Electronic Health Records (EHRs). Researchers can use techniques like Genome-wide association studies (GWASs), Phenome-wide association studies (PheWAS), Mendelian Randomization or Loss/Gain-of-Function Variants to discover associations between rare alleles and human disease (Krebs & Milani, 2023). Using these techniques, drugs tailored for Mendelian disorders have achieved notable success in clinical trials and approvals (Heilbron et al., 2021). For instance, the genetic disease Autosomal dominant hypercholesterolemia (ADH) confers an increased risk of coronary artery disease (CAD) through elevated levels of plasmatic low-density lipoprotein (LDL). By linking phenotypic data with genetic data, researchers were able to identify the association of the PCSK9 gene with high LDL levels (Abifadel et al., 2003). This kickstarted a series of studies that culminated in the approval of two monoclonal antibodies that inhibit PCSK9, Repatha (Evolocumab) and Praluent (Alirocumab) (Krebs & Milani, 2023; Robinson et al., 2015) with their treatment reducing the rate of major adverse cardiovascular events by half (Kaddoura et al., 2020). Indeed, therapies derived from these kinds of impactful rare alleles exhibit a 6-7.2 times greater likelihood of receiving approval due to their substantial effect on symptoms (Nelson et al., 2015; King et al., 2019). However, for many prevalent diseases, heritable risk is predominantly associated with numerous common variants, each having smaller individual effect sizes. This intricate genetic landscape complicates the identification of therapeutic targets, making the discovery of new avenues for therapy challenging and necessitating new strategies.


So far, a disproportionate number of published human genetic studies have centered on individuals of European descent (Fatumo et al., 2022). However, this narrow focus restricts our understanding to a limited diversity of alleles and genetic disorders, hindering the development of new therapies. To promote health equality, it’s crucial to establish study cohorts that include under‐represented populations. After all, individuals of European descent represent only a fraction of the total human genetic variation (Heilbron et al., 2021). Diverse cohorts represent unique opportunities for identifying novel drug targets based on genetic variants that are less frequent or even absent in people of European ancestry. Genetic discoveries will have greater discovery power in populations where a disease is more prevalent and, hence, with larger disease cohorts; at the same time, these discoveries will be more relevant and beneficial for these populations.


Founding the Alliance for Genomic Discovery

This need to identify rare genetic variants in diverse patient cohorts has driven the collaboration of NashBio and Illumina Inc. to establish AGD. AGD, comprising eight member organizations—AbbVie, Amgen, AstraZeneca, Bayer, Merck, Bristol Myers Squibb (BMS), GlaxoSmithKline Pharmaceuticals (GSK), and Novo Nordisk (Novo)—aims to expedite therapeutic development through whole-genome sequencing (WGS) 250,000 samples from Vanderbilt University Medical Center’s (VUMC) biobank repository, BioVU®. As the first phase in AGD, deCODE genetics performed WGS on the first 35,000 VUMC samples, primarily made up of DNA from individuals of African ancestry. Moving forward, deCODE/Amgen will sequence the remaining samples for the Alliance members to have access to the resulting data for drug discovery and therapeutic development. The WGS data will then be linked with structured EHR data from NashBio and VUMC, creating a valuable resource for pharmaceutical members to correlate genetic variations with clinical outcomes. To learn more about how AGD aims to accelerate drug discovery and to hear directly from the alliance members, click here.



AGD marks a pivotal step in reshaping drug development, offering a solution to the challenges plaguing the pharmaceutical industry. With a staggering 90% failure rate in clinical trials, the incorporation of human genetic evidence into drug development by AGD aims to increase the approval likelihood of drug targets, fostering the discovery of more effective and targeted treatments. AGD also aims to address the limitations of existing genetic resources and studies. The WGS of 250,000 samples, encompassing diverse populations and linked with structured EHR data, provides pharmaceutical members with a powerful resource. This not only accelerates drug discovery but also facilitates the development of tailored therapies. AGD represents a significant step toward healthcare equality, highlighting the importance of diverse genetic studies in progressing drug discovery for the benefit of all people.



Abifadel, M., Varret, M., Rabès, J.-P., Allard, D., Ouguerram, K., Devillers, M., Cruaud, C., Benjannet, S., Wickham, L., Erlich, D., Derré, A., Villéger, L., Farnier, M., Beucler, I., Bruckert, E., Chambaz, J., Chanu, B., Lecerf, J.-M., Luc, G., … Boileau, C. (2003). Mutations in PCSK9 cause autosomal dominant hypercholesterolemia. Nature Genetics, 34(2), 154–156.

Dowden, H., & Munro, J. (2019). Trends in clinical success rates and therapeutic focus. Nature Reviews. Drug Discovery, 18(7), 495–496.

Fatumo, S., Chikowore, T., Choudhury, A., Ayub, M., Martin, A. R., & Kuchenbaecker, K. (2022). A roadmap to increase diversity in genomic studies. Nature Medicine, 28(2), 243–250.

Harrison, R. K. (2016). Phase II and phase III failures: 2013-2015. Nature Reviews. Drug Discovery, 15(12), 817–818.

Heilbron, K., Mozaffari, S. V, Vacic, V., Yue, P., Wang, W., Shi, J., Jubb, A. M., Pitts, S. J., & Wang, X. (2021). Advancing drug discovery using the power of the human genome. The Journal of Pathology, 254(4), 418–429.

Hinkson, I. V., Madej, B., & Stahlberg, E. A. (2020). Accelerating Therapeutics for Opportunities in Medicine: A Paradigm Shift in Drug Discovery. Frontiers in Pharmacology, 11.

Kaddoura, R., Orabi, B., & Salam, A. M. (2020). PCSK9 Monoclonal Antibodies: An Overview. Heart Views : The Official Journal of the Gulf Heart Association, 21(2), 97–103.

King, E. A., Davis, J. W., & Degner, J. F. (2019). Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLOS Genetics, 15(12), e1008489.

Krebs, K., & Milani, L. (2023). Harnessing the Power of Electronic Health Records and Genomics for Drug Discovery. Annual Review of Pharmacology and Toxicology, 63(1), 65–76.

Nelson, M. R., Tipney, H., Painter, J. L., Shen, J., Nicoletti, P., Shen, Y., Floratos, A., Sham, P. C., Li, M. J., Wang, J., Cardon, L. R., Whittaker, J. C., & Sanseau, P. (2015). The support of human genetic evidence for approved drug indications. Nature Genetics, 47(8), 856–860.

Ochoa, D., Karim, M., Ghoussaini, M., Hulcoop, D. G., McDonagh, E. M., & Dunham, I. (2022). Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nature Reviews. Drug Discovery, 21(8), 551.

Robinson, J. G., Farnier, M., Krempf, M., Bergeron, J., Luc, G., Averna, M., Stroes, E. S., Langslet, G., Raal, F. J., El Shahawy, M., Koren, M. J., Lepor, N. E., Lorenzato, C., Pordy, R., Chaudhari, U., & Kastelein, J. J. P. (2015). Efficacy and Safety of Alirocumab in Reducing Lipids and Cardiovascular Events. New England Journal of Medicine, 372(16), 1489–1499.

Sun, D., Gao, W., Hu, H., & Zhou, S. (2022). Why 90% of clinical drug development fails and how to improve it? Acta Pharmaceutica Sinica. B, 12(7), 3049–3062.

healthcare data

De-identification: Balancing Privacy and Utility in Healthcare Data

By Healthcare Data

Key Takeaways:

  • De-identification is the process of removing or obscuring personal health information in medical records to protect patient privacy.
  • De-identification is critical for enabling the sharing of data for secondary research purposes such as public health studies while meeting privacy regulations like HIPAA.
  • Common de-identification techniques include suppression, generalization, perturbation and synthetic data generation.
  • There is often a balance between data utility and privacy risk that must be evaluated on a case-by-case basis when de-identifying data.
  • Emerging privacy-enhancing computation methods like federated learning and differential privacy offer complementary approaches to de-identification.


What is De-identification and Why is it Important for Healthcare?

Patient health information is considered highly sensitive data in need of privacy protections. However medical data sharing enables critically important research on public health, personalized medicine and more. De-identification techniques that remove identifying information and decrease the risk of exposing protected health information serve a crucial role in balancing these needs for privacy and innovation.


Definitions and Concepts

The HIPAA Privacy Rule defines de-identification as the process of preventing a person’s identity from being connected with health information. Once data has been de-identified per the Privacy Rule’s standards, it is no longer considered protected health information (PHI) and can be freely shared for research use cases like public health studies, therapeutic effectiveness studies  and medical informatics analytics.


Perfect de-identification that carries no risk of re-identification of patients is very difficult, if not impossible, to accomplish with current technology. As a result, regulations like HIPAA allow for formal designations of “de-identified” health data based on achieving sufficient pseudonymity through the suppression or generalization of identifying tag elements. HIPAA also defines a limited data set containing certain scrubbed identifiers that can be shared with a data use agreement rather than fully stripped identifiers.


The re-identification risk spectrum ranges from blatant identifiers like names, home addresses and social security numbers to quasi-identifiers like birthdates and narrowed locations that would not directly name the patient but could be pieced together to deduce identity in combination, especially as external data sources grow more public over time. State-of-the-art de-identification evaluates both blatant and quasi-identifier risks to minimize traceability while maximizing analytic utility.


Motivating Use Cases

Research and public health initiatives rely on the sharing of de-identified health data to drive progress on evidence and outcomes. The Cancer Moonshot’s data sharing efforts highlight the massive potential impact of medical databases, cohorts and real-world evidence generation on accelerating cures via de-identified data aggregation and analytics. The open FDA program demonstrates governmental encouragement of privacy-respecting access to regulatory datasets to inform digital health entrepreneurs. Patient matching in these fragmented healthcare datasets would be impossible using directly identifiable data. Apple’s ResearchKit and CareKit frameworks facilitate de-identified mobile health data sharing for app developers to build new participatory research applications.


Data marketplaces and trusted third parties are emerging to certify and exchange research-ready, consented data assets like clinico-genomic data underlying scientific publications and clinical trials. Startups and health systems manage data sharing agreements and audit logs around distributed sites leveraging de-identified data. Rich metadata combined with privacy-preserving record linkage techniques that avoid direct identifiers enables specific patient subgroup analytics without compromise.


Overall research efficiency improves when more participants openly share their health data. But none of this research progress would be possible if stringent de-identification practices were not implemented to earn patient trust in data sharing.


De-Identification Techniques and Standards

There are two high level categories of common de-identification protocols in healthcare: 1) suppressing blatant identifiers, typically following frameworks like HIPAA, and 2) actively transforming the data itself through various forms of generalization, perturbation or synthetic data production.


Suppressing Identifiers

The HIPAA Privacy Rule designates 18 categories of Protected Health Information identifiers that must be removed to achieve de-identified status, including names, geographic details narrower than state level, all dates other than years, contact information, IDs and record numbers, vehicle and device identifiers, URLs, IP addresses, biometrics etc.


Messages, images and unstructured data require specialized redaction processes to scrub both blatant and quasi-identifiers related to the patient, provider, institution or researchers involved. Named entity recognition and text annotation techniques help automate the detection of identifiable concepts. Voice data and video are more challenging mediums to de-identify.


Generalization and Aggregation

When formal dates, locations, ages over 89 and other quasi-identifiers cannot be completely suppressed without losing analytic value from the structured data, generalization techniques help band these details into abstract categories to preserve some descriptive statistics while hiding individual values.


Aggregating masked data across many patient records also prevents isolation of individuals. Row level de-identification risks in sparse data featuring outliers and uncommon combinations of traits can be mitigated by pooling data points into broader summaries before release rather than allowing raw access.



Perturbation encompasses a wide array of mathematical and statistical data alteration techniques that aim to distort the original data values and distributions while maintaining the general trends and correlations warranting analysis.


Value distortion methods include quantization to normalize numbers into ranges, squeezing and stretching value dispersion, rounding or truncating decimals, swapping similar records and discretizing continuous variables. Objects can be clustered into groups that are then analyzed in aggregate. Multiple perturbed versions of the dataset can be safely released to enable reproducible confirmation of discovered associations while avoiding leakage of the precise source data.


Combinations of generalization and perturbation provide flexibility for particular data types and contexts. The strengths, weaknesses and tuning of parameters merit a technical deep dive. The key is calibrating perturbation to maximize analytic integrity while minimizing correlation risk. Ongoing access rather than static publication also allows refinement of data treatment to meet evolving security assumptions and privacy regulations.


Synthetic Data

Synthetic datasets represent an emerging approach for modeling realistic artificial data distributions that resemble an actual patient group without containing the original records. Once the statistical shape of data properties is learned from the genuine dataset, simulated synthetic data can be sampled from generative models that emulate plausible features and relationships without allowing deduction of the real samples.


The underlying models must sufficiently capture multidimensional interactions and representation of minority groups within the patient population. Features such as ethnicity, outcomes, treatments and behaviors must be appropriately represented instead of using simplistic or biased summary statistics that ignore important correlations. Synthetic data techniques applying machine learning and differential privacy mechanisms to reconstruct distributions show significant promise for shared data sandbox environments. Cloud vendors like AWS, Google Cloud and Microsoft Azure now provide synthetic data services.


Evaluating the Risk-Utility Tradeoff

Ideally, de-identified health data removes enough identifying risk to prevent adversaries from recognizing individuals while retaining enough fidelity to offer scientific utility for the intended analyses by qualified researchers. But optimizing both privacy protection and analytic value requires navigating technical and ethical nuances around plausible re-identification vulnerabilities and scenarios balanced against access restrictions on derivative insights in the public interest.


Quantitative statistical metrics like k-anonymity models attempt to mathematically define anonymity sets with at least k records containing a combination of quasi-identifiers to avoid isolation. L-diversity metrics further generalize and dilute these groups to limit confidence of guessing the matching identity. Closeness measures how much perturbation may have altered correlations. Quantifying information loss helps data curators shape treatment processes and inclusion of synthetic records. Interpreting these model-based metrics requires understanding their assumptions and limitations with respect to adversary means and background knowledge.


More meaningful measures account for qualitative harms of identity traceability for affected groups based on socioeconomic status, minority populations, immigration factors, substance history, abuse status, disability needs and other cultural contexts that influence vulnerability irrespective of mathematical protections. Trusted access policies should offer options for verifiable rationale when requesting clearer data from data stewards who can evaluate situational sensitivity factors.


Overall responsibility falls upon custodial institutions and data safe havens to conduct contextual integrity assessments ensuring fair data flows to legitimate purposes. This means formally evaluating both welfare impacts on individuals and excluded populations, as well as potential data misuses or manipulative harms at population scale, such as discriminatory profiling. Updated governance mechanisms must address modern re-identification realities and connective threats.


Future Directions

Traditional de-identification practices struggle with handling high-dimensional, heterogeneous patient profiles across accumulating data types, modalities, apps, sensors, workflows and research studies. While valuable on their own, these techniques may fail to fully protect individuals as the ubiquity of digital traces multiplies potential quasi-identifiers. Absolute anonymity also severely limits permissible models and computations.


Emerging areas like federated analytics and differential privacy relax the goal of total de-identification by keeping raw records secured on distributed data servers and only allowing mathematical summaries to be queried from a central service so that statistical patterns can be discovered from many sites without exposing actual inputs from any one site. Legally defined LIMITED DATA SETS similarly bridge consented data access with managed identity risks for pre-vetted analysts.


Differentially private computations introduce mathematically calibrated noise to guarantee that the presence or sensitive attributes tied to any one patient will be masked across many patients. This masking allows research insights to be uncovered without revealing individual contributions. Secure multiparty computation and homomorphic encryption also enable certain restricted computations like aggregates, means and distributions executed on sensitive inputs while keeping the underlying data encrypted.


Such cryptographic methods and privacy-enhancing technologies provide complementary assurances to traditional de-identification practices. But governance, interpretation and usability remain active areas of improvement to fulfill ethical promises in practice. Holistic data safe havens must align emerging privacy-preserving computation capabilities with rigorous curation, context-based de-identification protocols and trust-based oversight mechanisms that can demonstrably justify public interest usages while preventing tangible harms to individuals and communities whose sensitive data fuels research.



Health data

Understanding Key Health Data Types: Clinical Trials, Claims, EHRs

By Clinical Trials, EHR, Health Data Types

Key Takeaways:

  • Key healthcare data types include clinical trials, insurance claims, and electronic health records (EHRs), each with distinct purposes.
  • Clinical trial data directly captures efficacy and safety of interventions, but availability is limited until publication and may lack generalizability.
  • Insurance claims provide large-scale utilization patterns, outcomes metrics across diverse groups, and cost analysis, but lack clinical precision.
  • EHR data offers longitudinal individual patient history and care details in operational workflows but quality and standardization varies.
  • Combining evidence across clinical trials, claims data, and EHRs enables real-world monitoring of interventions to guide optimal decisions and policies.


In an era of big data and analytics-driven healthcare, evidence informing clinical and policy decisions draws from an expanding variety of data sources that capture different aspects of patient care and outcomes. Three vital sources of health data include structured databases tracking results of clinical trials, administrative insurance claims systems, and electronic health records (EHRs) compiled at hospitals and health systems. Each data type serves distinct purposes with inherent strengths and limitations.


This article explains the defining characteristics, appropriate use cases, and limitations of clinical trials, insurance claims data, and EHRs for healthcare and life science researchers, operators, and innovators.. Combining complementary dimensions across data types enables robust real-world monitoring of healthcare interventions to guide optimal decisions and policies matched to specific populations.


Clinical Trials

The randomized controlled trial (RCT) serves as the gold standard for evaluating safety and efficacy of diagnostic tests, devices, biologics, and therapeutics prior to regulatory approvals. Clinical trials compare treatments in specific patient groups, following strict protocols and monitoring outcomes over a set study period. Data elements captured include administered treatments, predefined clinical outcomes, patient-reported symptoms, clinician assessments, precision diagnostics, genomic biomarkers, other quantifiable endpoints, and adverse events.


RCT datasets supply the most scientifically valid assessment of efficacy and toxicity for an intervention compared to alternatives like placebos or other drugs because influential variables are intentionally balanced across study arms using eligibility criteria and random assignment. This internal validity comes at a cost of potentially reduced generalizability and applicability. As a result there is a challenge in translating benefits and risks accurately into heterogeneous real-world populations. Published trial findings often overstate effectiveness when applied more broadly. Additional data from pragmatic studies is needed to complement classical efficacy findings along the product lifecycle.


Supplemental data integration is required to expand evidence beyond the limited snapshots of clinical trial participants and into continuous monitoring of outcomes across wider populations who are prescribed the treatments clinically. Here the high-level perspectives of insurance claims data and granular clinical details contained in EHRs play a vital role.


Insurance Claims

Administrative claims systems maintained by public and commercial health insurers serve payment and reimbursement purposes rather than research goals. Yet analysis of population-level claims data containing coded diagnoses, procedures performed, medications dispensed, specialty types, facilities visited, costs billed and reimbursed enables important usage trends, treatment patterns, acute events, and cost efficiency insights which complements clinical trials.


Claims provide researchers a broad window into diagnoses, prescribed interventions, and health outcomes frequently spanning millions of covered lives across geographical regions that are absent from most trials. Claims data encompasses at all covered care delivered rather than isolated interventions. Examining trends over longer timeframes across more diverse patients who differ from strict trial eligibility enables assessment of real-world utilization frequencies, comparative effectiveness versus alternatives, clinical guideline adherence, acute complication rates, mortality metrics, readmission trends, and direct plus indirect medical costs.


However, claims data lacks the precise clinical measures systematically captured in trials and EHR records. Billing codes often fail to specify clinical severity or capture quality of life impacts. Available data elements focus primarily on how much and how often healthcare services are used rather than qualitative clinical details or patient-reported outcomes. Underlying diagnoses and accuracy of coding may require supplementary validation. Despite its limitations, claims data plays a crucial role in providing essential information for healthcare professionals, researchers, and policymakers. It serves as a valuable tool for monitoring diverse aspects of the healthcare system, ultimately contributing to the assurance of efficient, safe, and effective treatments.


While abbreviated claims codes document utilization events at a population level and clinical trials quantify experience for circumscribed groups, the patient-centric Electronic Health Record (EHR) details comprehensive individual-level clinical data as an immutable ledger accumulated over years of clinical encounters across care settings. The longitudinal EHR chronicles detailed diagnoses, signs and symptoms, lab orders and results, exam findings, procedures conducted, prescriptions written, physician notes and orders, referral details, communications around critical results, and other discrete or unstructured elements reflecting patient complexity often excluded from claims data and trials.



EHRs provide fine-grained data for precision medicine inquiries into subsets of patients with common clinical trajectories, risk profiles, comorbidities, socioeconomic factors, access challenges, genomic risks, family histories of related illnesses, lifestyle behaviors like smoking, and personalized interventions based on advanced molecular markers. EHR data supports deep phenotyping algorithms and temporal pattern analyses that can extract cohort comparisons not feasible solely from claims.


Secondary use of EHR data faces challenges in representativeness when drawing data from single health systems rather than national networks, variability in coding terminologies and data entry fields across platforms, fragmentation forcing linkage between separate specialties and sites of care, semi-structured formats with mixed discrete codified and free text variables, and data quality gaps during clinician workflow constraints. Population-based claims data ensures inclusion of patients seeking care across all available providers rather than just one health system.


Integrating Complementary Evidence

Definitive clinical trial efficacy remains the gold standard when initially evaluating medical interventions, while large-scale claims data offers a complementary view of broader utilization patterns and comparative outcomes across more diverse populations who are receiving interventions in clinical practice. However, as interventions diffuse beyond the research setting, reliable acquisition of clinical details requires merging population-based signals from claims with deep clinical data contained uniquely within EHRs.


Combining evidence across clinical trials, claims databases, and EHR repositories maximizes strengths of each data type while overcoming inherent limitations of any single source. Clinical trials determine effectiveness, and combining insights from large-scale claims data with detailed clinical information in EHRs is crucial for assessing interventions as they transition from research to practical healthcare, contributing to overall healthcare improvement.


Aspect Clinical Trial Data Claims Data EHR Data
Primary Purpose Research and development of new treatments Billing and reimbursement for services Patient care and health record keeping
Data Source Controlled clinical studies Insurance companies, healthcare providers Healthcare providers
Data Types Included Patient demographics, treatment details, outcomes Patient demographics, services rendered, cost Patient demographics, medical history, diagnostics, treatment plans
Data Structure Highly structured and standardized Structured but varies with payer systems Structured and unstructured (e.g., doctor’s notes)
Temporal Span Limited to the duration of the trial Longitudinal, covering the duration of coverage Longitudinal, covering comprehensive patient history
Access and Privacy Restricted, subject to clinical trial protocols Restricted, governed by health insurance portability and accountability act (HIPAA) regulations Restricted, governed by HIPAA and patient consent
Primary Users Researchers, pharmaceutical companies Healthcare providers, payers, policy makers Healthcare providers, patients
Data Volume and Variety Relatively limited, focused on specific conditions Large, diverse covering a wide range of conditions and services Large, diverse, includes a wide range of medical information
Use in Healthcare Drug development, understanding treatment effectiveness Healthcare economics, policy making, fraud detection Direct patient care, diagnosis, treatment planning
Challenges Limited generalizability, high cost Variability in coding, potential for missing data Inconsistent data entry, variability in EHR systems



Polygenic Risk Scores

An Introduction to Polygenic Risk Scores: Aggregating Small Genetic Effects to Stratify Disease Risk

By Polygenic Rick Scores

Key Takeaways:

  • Polygenic risk scores aggregate the effects of thousands of genetic variants to estimate an individual’s inherited risk for complex diseases.
  • Polygenic risk is based on genome-wide association studies that identify common variants associated with modest increases in disease risk.
  • Polygenic scores provide risk stratification beyond family history, but most disease risk is not yet explained by known variants.
  • Clinical validity and utility of polygenic scores will improve as more disease-associated variants are discovered through large genomic studies.
  • Polygenic risk models may one day guide targeted screening and preventive interventions, but face challenges related to clinical interpretation and implementation.


Introduction to Polygenic Risk Scores

The vast majority of common, chronic diseases do not follow simple Mendelian inheritance patterns, but rather are complex genetic conditions arising from the combined small effects of thousands of genetic variations interacting with lifestyle and environmental factors. Polygenic risk scores aggregate information across an individual’s genome to estimate their inherited susceptibility for developing complex diseases like heart disease, cancer, diabetes and neuropsychiatric disorders.


Polygenic risk scores are constructed using data from genome-wide association studies (GWAS) that scan markers across the genomes of thousands to millions of individuals to identify genetic variants associated with specific disease outcomes. While most disease-associated variants have very small individual effects, the combined effect of thousands of these common, single nucleotide polymorphisms (SNPs) can stratify disease risk in a polygenic model.


Polygenic Scores vs. Single Gene Mutations

In monogenic diseases like cystic fibrosis and Huntington’s disease, a single genetic variant is necessary and sufficient to cause disease. Genetic testing for causal mutations in specific disease-linked genes provides a clear-cut diagnostic assessment. In contrast, no single gene variant accounts for more than a tiny fraction of risk for complex common diseases. Polygenic risk models aggregate the effects of disease-associated variants across the genome, each imparting a very modest increase or decrease in risk. An individual’s polygenic risk score reflects the cumulative impact of thousands of small risk effects spread across their genome.


While polygenic scores are probabilistic and estimate only inherited genetic susceptibility, monogenic mutations convey deterministic information about disease occurrence. However, for many individuals with elevated polygenic risk scores, modifiable lifestyle and environmental factors may outweigh their inherited predisposition, allowing prevention through early intervention.


GWAS and Polygenic Scores

Human genome-wide association studies utilize DNA microarray ‘chips’ containing hundreds of thousands to millions of SNPs across the genome. Comparing SNP frequencies between thousands of disease cases and controls reveals variations associated with disease diagnosis. Each SNP represents a common genetic variant present in more than 1-5% of the population. Individually, SNP effects on disease risk are very modest, usually less than 20% increase in relative risk.


However, by aggregating the effects of disease-associated SNPs, polygenic risk models can categorize individuals along a spectrum of low to high inherited risk. Polygenic scores typically explain 7-12% of disease variance, though up to 25% for some cancers. The more powerful the original GWAS in terms of sample size, the better the polygenic score will be at predicting an individual’s predisposition.


Constructing Polygenic Scores

Various methods exist for constructing polygenic scores after identifying disease-associated SNPs through GWAS. Most commonly, a SNP effect size is multiplied by the number of risk alleles (0, 1 or 2) for that SNP in a given individual. These products are summed across all chosen SNPs to derive an overall polygenic risk score. SNPs strongly associated with disease receive more weight than weakly associated markers.


Rigorous validation in independent sample sets evaluates the predictive performance of polygenic scores. Optimal SNP inclusion thresholds are selected to maximize predictive ability. Polygenic models lose power with too few or too many SNPs included. Ideal thresholds retain SNPs explaining at least 0.01% of disease variance based on GWAS significance levels.


Applications and Limitations

Polygenic risk models are currently most advanced for coronary artery disease, breast and prostate cancer, type 2 diabetes and inflammatory bowel disease. Potential clinical applications include:


  • Risk stratification to guide evidence-based screening recommendations beyond family history.
  • Targeted prevention and lifestyle modification for individuals at elevated genetic risk.
  • Informing reproductive decision-making and genetic counseling based on polygenic risk.
  • Improving disease prediction, subtyping and prognosis when combined with clinical risk factors.


However, limitations and ethical concerns exist around polygenic score implementation:

  • Most heritability remains unexplained. Adding more SNPs only incrementally improves prediction.
  • Polygenic testing may prompt unnecessary interventions if clinical validity and utility are not adequately demonstrated.
  • Possible psychological harm and discrimination from genetic risk probabilization.
  • Unequal health benefits if not equitably implemented across populations.


While polygenic scores currently identify individuals with modestly increased or decreased disease risks, their predictive utility is anticipated to grow exponentially with million-person biobank efforts and whole-genome sequencing. Harnessing the full spectrum of genomic variation contributing to polygenic inheritance will enable more personalized risk assessment and clinical decision-making for complex chronic diseases.



  1. Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat Rev Genet. 2018 Sep;19(9):581-590.
  2. Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020 May 6;12(1):44.
  3. Khera AV, Chaffin M, Zekavat SM, et al. Whole-genome sequencing to characterize monogenic and polygenic contributions in patients hospitalized with COVID-19. Nat Commun. 2021 Jan 20;12(1):536.
  4. Torkamani A, Erion G, Wang J, et al. An evaluation of polygenic risk scores for predicting breast cancer. Breast Cancer Res Treat. 2019 Apr;175(2):493-503.
  5. Mars N, Koskela JT, Ripatti P, Kiiskinen T TJ, Havulinna AS, Lindbohm JV, Ahola-Olli A, Kurki M, Karjalainen J, Palta P, FinnGen, Neale B, Daly M, Salomaa V, Palotie A, Collins F, Samani N, Ripatti S. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med. 2020 Nov;26(11):1660-1666.
Clinical Trials vs. Real-World

Clinical Trials vs. Real-World Data: Understanding the Differences and Complementary Roles

By Clinical Trials

Key Takeaways:


  • Clinical trials are controlled experiments designed to evaluate safety and efficacy of new drugs or devices. Real-world data comes from more diverse, less controlled sources like electronic health records and medical claims.
  • Clinical trials have strict inclusion/exclusion criteria and measure predefined outcomes. Real-world data reflects broader populations with various comorbidities and outcomes.
  • Clinical trials are required for regulatory approval but have limitations like small sample sizes. Real-world evidence can complement trials with larger volumes of data over longer time periods.
  • Real-world data comes from routine clinical practice rather than protocol-driven trials. It provides supplementary information on effectiveness and safety.
  • Limitations of real-world data include lack of randomization, potential biases and confounders. Analytic methods help account for these limitations.
  • Real-world evidence has growing applications in medical product development, post-market surveillance, regulatory decisions and clinical guideline development.

Clinical Trials vs. Real-World Data

Clinical trials are prospective studies that systematically evaluate the safety and efficacy of investigational drugs, devices or treatment strategies in accordance with predefined protocols and statistical analysis plans. They are considered the gold standard for assessing the benefits and risks of medical interventions prior to regulatory approval. In clinical trials, participants are assigned to receive an investigational product or comparator/placebo according to a randomized scheme. These studies are designed to minimize bias and carefully control variables that may affect outcomes. Participants are closely monitored per protocol, and data is collected on prespecified points in time. The resulting evidence from randomized controlled trials serves as the primary basis for regulatory decisions regarding drug and device approvals.


In contrast, real-world data (RWD) refers to data derived from various non-experimental or observational sources that reflect routine clinical practice. Sources of RWD include electronic health records (EHRs), medical claims, registry data and patient-generated data from mobile devices, surveys or wearables. Real-world evidence (RWE) is the clinical evidence generated from aggregation and analysis of RWD. While clinical trials evaluate medical products under ideal, controlled conditions in limited samples of patients, RWD offers information about usage, effectiveness and safety in broader patient populations in real-world settings.


Some key differences between clinical trials and real-world data:

Sample Populations – Clinical trials have strict inclusion and exclusion criteria, resulting in homogeneous samples that often under represent minorities, elderly, pediatric and complex patient groups. RWD reflects more diverse real-world populations with various comorbidities and concomitant medications.


  • Settings – Clinical trials are conducted at specialized research sites under tightly controlled conditions. RWD comes from routine care settings like hospitals, clinics and pharmacies across diverse geographies and populations.
  • Interventions – Clinical trials administer interventions per protocol. RWD reflects variabilities in real-world treatment patterns and patient adherence.
  • Outcomes – Clinical trials measure prespecified outcomes over limited timeframes. RWD captures broader outcomes like patient-reported outcomes, quality of life, hospitalizations and costs over longer periods in real-world practice.
  • Data Collection – Clinical trials collect data per protocol at predefined assessment points. RWD is collected during routine care and reflected in patient records and claims.
  • Sample Size – Clinical trials often have small sample sizes with a few hundred to several thousand patients. RWD encompasses data from tens or hundreds of thousands of patients.
  • Randomization – Clinical trials use randomization to minimize bias when assigning interventions. RWD studies are observational without the benefits of randomization.


While randomized controlled trials provide high quality evidence for drug/device approvals and clinical recommendations, RWD offers complementary information on effectiveness, safety, prescribing patterns and health outcomes:


  • RWD can provide broader demographic representation for subpopulations underrepresented in trials.
  • RWD can inform on long-term safety, durability of treatment effects and comparative effectiveness between therapies.
  • RWD can provide larger sample sizes to study rare events or outcomes.
  • RWD can reflect real-world utilization rates, switching patterns and adherence to therapies.
  • RWD offers granular data for personalized medicine, risk identification, prediction modeling and tailored interventions.
  • RWD is more timely, cost-effective and scalable than conducting large trials.


However, RWD has inherent limitations compared to clinical trials:


  • Lack of randomization increases potential for bias and confounding.
  • Incomplete data or misclassification errors are common with medical records.
  • Inability to firmly conclude causality due to observational nature.
  • Possible selection biases and variations in care delivery across settings.
  • Inconsistencies in definitions, coding, documentation practices over time and sites.


Analytical methods help account for these limitations when generating real-world evidence from RWD:


  • Advanced analytics like machine learning can identify trends and associations within large RWD.
  • Predictive modeling and simulations can estimate treatment effects.
  • Adjusting for confounders, stratification, matching patients, propensity scoring help reduce biases.
  • Expert review of data and methodology helps ensure reliability.


Applications of RWE are expanding and gaining acceptance from key stakeholders:


  • Supplement clinical trial data for regulatory, coverage and payment decisions around medical products.
  • Post-market surveillance of drug and device safety and utilization in real-world practice.
  • Life cycle evidence generation for new indications, formulations, combination products.
  • Provide inputs into clinical guidelines by professional societies.
  • Risk identification/stratification, predictive modeling and personalized medicine.
  • Value-based contracting between manufacturers and payers.
  • Risk management and safety programs for hospitals and health systems.


In summary, clinical trials provide foundational evidence to introduce new medical products, while RWE offers complementary insights on effectiveness, safety, prescribing patterns and health outcomes at a larger scale across diverse real-world populations. Advanced analytics help derive meaningful RWE from RWD, with growing applications across the healthcare life science ecosystems. Together, these sources of evidence offer a multifaceted understanding to guide optimal use of medical products and improve patient care.



  1. What are the different types of clinical research?
  2. Berger ML, et al. Real-World Evidence: What It Is and What It Can Tell Us According to the ISPOR Real-World Data Task Force. Value Health. 2021 Sep;24(9):1197-1204.
  3. Sherman RE, et al. Real-World Evidence – What Is It and What Can It Tell Us? N Engl J Med. 2016 Dec 8;375(23):2293-2297.
  4. Yu T, et al. Benefits, Limitations, and Misconceptions of Real-World Data Analyses to Evaluate Comparative Effectiveness and Safety of Medical Products. Clin Pharmacol Ther. 2019 Oct;106(4):765-778.
  5. Food and Drug Administration. Real-World Evidence.
Polygenic risk score

The Role of Polygenic Risk Scores in Clinical Genomics

By Clinical Genomics


We were promised the end to genetic diseases. All we needed to do was unlock the human genome. Unfortunately, life has a way of being more complicated than we expect. It turned out that many genetic disorders are the result of the interplay between multiple genetic factors. This set off the need for improved analytical tools to analyze human genetics that could interrogate the associations of many genetic backgrounds and link them to various diseases. One such technique, the Polygenic Risk Score (PRS), emerged as a powerful tool to quantify the cumulative effects of multiple genetic variants on an individual’s predisposition to a specific disease.

The Evolution of Polygenic Risk Scores

The genesis of PRS can be traced back to the early 2000s when researchers sought to comprehend the collective impact of multiple genetic variants on disease susceptibility. Initially viewed through a biological lens, the focus was on enhancing the prediction of diseases by analyzing subtle genomic variations. Studies concentrated on prevalent yet complex diseases such as diabetes, cardiovascular diseases, and cancer, laying the groundwork for a comprehensive understanding of their genetic architecture.


That was until Dr. Sekar Kathiresan showed that the prediction from a PRS was just as clinically useful as a single variant (Khera et al., 2018). Instead of looking at the percent of people with a PRS in each group (with or without a disease), his group could show a much more obvious effect – the difference in risk for people in the groups with the highest and lowest scores. Then, they could say that there was a huge difference in risk for these two edges of the population.


In the initial stages, PRSs consisted of only the most statistically significant variants from genome-wide association studies. Geneticists often added up the quantity of risk variants without giving them a weight for how much of an impact they had on whether someone would get a disease. Refining these scores led scientists to challenge arbitrary risk cutoffs and advocate for the inclusion of all variants to maximize statistical power (based on the assumption that, on average, variants that have no effect are evenly distributed to appear positively or negatively correlated to the trait). However, proximity of variants on a chromosome presented another challenge. If variants were closer together on a chromosome, they would be less likely to be separated during recombination (Linkage Disequilibrium). This would result in them carrying the signal of something that had a true effect, potentially leading to an overcounting of that signal.


To deal with this, geneticists used tools to remove signals within a specified block unless their correlation with the strongest signal fell below a threshold. One of the first packages, PRSice (Choi & O’Reilly, 2019), used an approach called Pruning and Thresholding. Scientists would choose a block size, say, 200,000 base pairs. A program would go through and slide that block along the genome. If there was more than a single signal in that block, the program would remove (or “prune”) all but the strongest signal unless the variant had a smaller correlation with the strongest signal than the “threshold”. The result was that in a region with many different variants that affected the risk of a disease, but which were still a bit correlated, signal could be lost.


Criticism from biostatisticians prompted a shift towards a Bayesian approach, reducing over-counting while better accounting for partially independent signals. Implementation was challenged by the extensive computational resources needed to update the signal at each genetic location based on linkage disequilibrium of the surrounding SNPs. One program, called PRS-CS (Ge et al., 2019), implemented a method that could apply changes to a whole linkage block at once, addressing both the geneticist demand for a good system that can provide results using the computation tools we have and the biostatistician demand for accuracy and retained information.


Despite these advancements, accuracy challenges persisted, particularly when applying scoring systems across populations with different genetic ancestries. It turned out Linkage Disequilibrium was a pervasive problem. The patterns of Linkage Disequilibrium are different in people with different genetic ancestries. In fact, even statistics about the patterns themselves, like how big an average block size is, are different. Recognizing the need for improvement, ongoing efforts in refining PRSs aim to address these challenges, paving the way for more accurate and reliable applications. As researchers delve deeper into these complexities, the evolving landscape of PRSs continues to shape the future of clinical research.

Polygenic Risk Scores in Clinical Research Settings

To harness the full potential of PRS in clinical practice, a crucial shift is needed—from population-level insights to personalized predictions for individual patients. This transformation involves converting relative risks, which compare individuals across the PRS spectrum with a baseline group, into absolute risks for the specific disease (Lewis & Vassos, 2020). The current emphasis is on identifying individuals with a high genetic predisposition to disease, forming the foundation for effective risk stratification. This information guides decisions related to participation in screening programs, lifestyle modifications, or preventive treatments when deemed suitable.


In practical applications, PRS demonstrates promise in patient populations with a high likelihood of disease. Consider a recent study in an East Asian population, where researchers developed a PRS for Coronary Artery Disease (CAD) using 540 genetic variants (Lu et al., 2022). Tested on 41,271 individuals, the top 20% had a three-fold higher risk of CAD compared to the bottom 20%, with lifetime risks of 15.9% and 5.8%, respectively. Adding PRS to clinical risk assessment slightly improved accuracy. Notably, individuals with intermediate clinical risk and high PRS reached risk levels similar to high clinical risk individuals with intermediate PRS, indicating the potential of PRS to refine risk assessment and identify those requiring targeted interventions for CAD.


Another application of PRS lies in improving screening for individuals with major disease risk alleles (Roberts et al., 2023). A recent breast cancer risk assessment study explored pathogenic variants in high and moderate-risk genes (Gao et al., 2021). Over 95% of BRCA1, BRCA2, and PALB2 carriers had a lifetime breast cancer risk exceeding 20%. Conversely, integrating PRS identified over 30% of CHEK2 and almost half of ATM carriers below the 20% threshold. Indeed, a similar result was found in a separate study when researchers investigated men with high blood levels of prostate-specific antigen (PSA). 


This trend extends to other diseases, such as prostate cancer, where a separate investigation focused on men with elevated levels of prostate-specific antigen (PSA) (Shi et al., 2023). Through the application of PRS, researchers pinpointed over 100 genetic variations linked to increased PSA levels. Ordinarily, such elevated PSA levels would prompt prostate biopsies to assess potential prostate cancer. By incorporating PRS into the screening process, doctors could have accounted for the natural variation in PSA level and prevent unnecessary escalation of clinical care. These two studies suggest that PRS integration into health screening enhances accuracy, preventing unnecessary tests and enabling more personalized risk management.


In the realm of pharmacogenetics, efforts to optimize treatment responses continue. While progress has been made in identifying rare high-risk variants linked to adverse drug events, predicting treatment effectiveness remains challenging. The evolving role of PRS in treatment response is particularly evident in statin use for reducing initial coronary events. In a real-world cohort without prior myocardial infarction, an investigation revealed that statin effectiveness varied based on CHD PRSs, with the highest impact in the high-risk group, intermediate in the intermediate-risk group, and the smallest effect in the low-risk group (Oni-Orisan et al., 2022). Post-hoc analyses like this for therapeutics could potentially allow for more targeted enrollment for clinical trial design, substantially reducing the number of participants needed to demonstrate trial efficacy (Fahed et al., 2022).


As the field of genetics continues to advance, PRSs emerge as a potent tool with the potential to aid clinical research. Validated PRSs show promise in enhancing the design and execution of clinical trials, refining disease screening, and developing personalized treatment strategies to improve the overall health and well-being of patients. However, it’s crucial to acknowledge that the majority of PRS studies heavily rely on biased datasets of European ancestry. To refine and improve PRS, a comprehensive understanding of population genetic traits for people of all backgrounds, such as linkage disequilibrium, is essential. Moving forward, the integration of PRS into clinical applications must prioritize datasets with diverse ancestry to ensure equitable and effective utilization across all patient backgrounds. As research in this field progresses, the incorporation of PRS is poised to become an indispensable tool for expediting the development of safer and more efficacious therapeutics.



Choi, S. W., & O’Reilly, P. F. (2019). PRSice-2: Polygenic Risk Score software for biobank-scale data. GigaScience, 8(7).


Fahed, A. C., Philippakis, A. A., & Khera, A. V. (2022). The potential of polygenic scores to improve cost and efficiency of clinical trials. Nature Communications, 13(1), 2922.


Gao, C., Polley, E. C., Hart, S. N., Huang, H., Hu, C., Gnanaolivu, R., Lilyquist, J., Boddicker, N. J., Na, J., Ambrosone, C. B., Auer, P. L., Bernstein, L., Burnside, E. S., Eliassen, A. H., Gaudet, M. M., Haiman, C., Hunter, D. J., Jacobs, E. J., John, E. M., … Kraft, P. (2021). Risk of Breast Cancer Among Carriers of Pathogenic Variants in Breast Cancer Predisposition Genes Varies by Polygenic Risk Score. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 39(23), 2564–2573.


Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1), 1776.


Khera, A. V., Chaffin, M., Aragam, K. G., Haas, M. E., Roselli, C., Choi, S. H., Natarajan, P., Lander, E. S., Lubitz, S. A., Ellinor, P. T., & Kathiresan, S. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50(9), 1219–1224.


Lewis, C. M., & Vassos, E. (2020). Polygenic risk scores: from research tools to clinical instruments. Genome Medicine, 12(1), 44.


Lu, X., Liu, Z., Cui, Q., Liu, F., Li, J., Niu, X., Shen, C., Hu, D., Huang, K., Chen, J., Xing, X., Zhao, Y., Lu, F., Liu, X., Cao, J., Chen, S., Ma, H., Yu, L., Wu, X., … Gu, D. (2022). A polygenic risk score improves risk stratification of coronary artery disease: a large-scale prospective Chinese cohort study. European Heart Journal, 43(18), 1702–1711.


Oni-Orisan, A., Haldar, T., Cayabyab, M. A. S., Ranatunga, D. K., Hoffmann, T. J., Iribarren, C., Krauss, R. M., & Risch, N. (2022). Polygenic Risk Score and Statin Relative Risk Reduction for Primary Prevention of Myocardial Infarction in a Real-World Population. Clinical Pharmacology and Therapeutics, 112(5), 1070–1078.


Roberts, E., Howell, S., & Evans, D. G. (2023). Polygenic risk scores and breast cancer risk prediction. Breast (Edinburgh, Scotland), 67, 71–77.


Shi, M., Shelley, J. P., Schaffer, K. R., Tosoian, J. J., Bagheri, M., Witte, J. S., Kachuri, L., & Mosley, J. D. (2023). Clinical consequences of a genetic predisposition toward higher benign prostate-specific antigen levels. EBioMedicine, 97, 104838.

drug response data

Bridging the Gap: How AI Companies in the TechBio Space are Revolutionizing Biopharma using Genomics and Drug Response Data

By AI No Comments


Innovations in Artificial Intelligence (AI) have propelled pharmaceutical companies to revolutionize their approaches to designing, testing, and bringing precision medicine and healthcare solutions to the market. Two key elements in advancing precision medicine include early disease detection and understanding drug responders within distinct populations. By leveraging genomics and clinical notes, AI companies, specifically in the TechBio space, are transforming the way biopharma industries identify, understand, and cater to individuals rather than whole populations.

The Challenge: Precision Medicine and Drug Response

Traditional drug development methods often analyze the success of a drug treatment as its effect on a patient population, leading to highly variable outcomes and adverse effects among individual patients. This is despite the fact that for many diseases the underlying mechanisms driving symptoms can be quite different from person to person. This lack of individualization in treatment can hinder therapeutic efficacy at the group level despite effectiveness for certain individuals. If we hope to accelerate drug development to get cures in the hands of people faster, future research needs intelligent, cost-effective methods to stratify patients based on the contribution of different disease mechanisms and drug processing capabilities. AI companies are helping biopharma address this challenge by incorporating genomics and insights garnered from those individual’s de-identified patient clinical charts in a systematic way.

Genomics: The Blueprint of Personalization

The genomic revolution has undoubtedly paved the way for precision medicine in Biopharma. By analyzing an individual’s genetic data, scientists can identify variations that may influence drug metabolism and response. This approach has already proven highly effective, particularly in the case of breast cancer patients. In some instances of breast cancer, there is an overexpression of the HER2/neu protein (Gutierrez & Schiff, 2011). When genomic markers for this overexpression are identified, anti-HER2 antibodies can be incorporated into the treatment regimen, significantly enhancing survival rates. AI companies are at the forefront of continuing this research by utilizing genomics for the creation of genetic sub-groups essential for biomarker discovery and predicting the most effective drug treatments for individual patients (Quazi, 2022).

Disease Detection and Monitoring with AI-Enhanced Biomarker Research

Early detection and monitoring of disease progression are paramount for improving patient survival rates. Traditionally, biomarker research has focused on identifying individual molecules or transcripts that can serve as early indicators of future severe illness. However, the field is evolving beyond the notion of a single-molecule biomarker diagnostic. Instead, it is turning to AI to examine the relationships between molecules and transcripts, offering a more comprehensive approach to identifying the onset of significant diseases (Vazquez-Levin et al., 2023). Over the past decade, cancer research and clinical decision-making have undergone a significant transformation, shifting from qualitative data to a wealth of quantitative digital information.

Universities and clinical institutions globally have contributed a vast trove of biomarkers and imaging data. This extensive dataset encompasses insights from genomics, proteomics, metabolomics, and various omics disciplines, as well as inputs from oncology clinics, epidemiology, and medical imaging. AI, uniquely positioned to integrate this diverse information, holds the potential to spearhead the development of pioneering predictive models for drug responses, paving the way for groundbreaking advancements in disease diagnosis, treatment prediction, and overall decision-making concerning novel therapies. 

With growing collections of data, it is becoming easier to model how a drug will shift an individual’s biology for worse or better. A recent example of this modelling is in the Cancer Patient Digital Twin (CPDT) project, where, the collection of multimodal temporal data from cancer patients can be employed to build a Digital Twin (a virtual replica of a patient’s biological processes and health status), allowing for in silico experimentation, which may guide testing, treatment, or decision points (Stahlberg et al., 2022).

One example is how the detection of metastatic disease over time could be improved from radiology reports. Researchers exposed prediction models to historical information using Natural Language Processing (NLP) (Batch et al., 2022). The authors were able to extract and encode relevant features from medical text reports, and use these features to develop, train, and validate models. Over 700 thousand radiology reports were used for model development to predict the presence of metastatic disease. Results from this study suggest that NLP models can extract cancer progression patterns from multiple consecutive reports and predict the presence of metastatic disease in multiple organs with higher performance than previous analytical techniques. Early knowledge of disease states or disease risk could lead to revised risk:benefit assessments for treatments and testing, potentially influencing patients’ choices. As a result, patients with otherwise comparable profiles may opt for treatments or tests they would not have otherwise considered. Even in cases where we do not have good biomarkers for disease (for example, Alzheimer’s disease, where most of the biomarkers are quite invasive to collect), knowing that a person has a higher disease risk earlier can enable important research that can lead to better biomarkers and, ultimately, better treatments.     

AI-Driven Pharmacogenomics: Revolutionizing Precision Medicine and Clinical Trials

While traditional approaches have paved the way for tailored medical treatments, the integration of AI can supercharge these efforts by leveraging an individual’s genetic information. For instance, consider the case of Warfarin, a widely prescribed anticoagulant. Accurate dosing for Warfarin is critical during the start of treatment, which carries higher risks of bleeding and clotting issues. Over decades, dose-response models have been developed to better understand how this drug affects the human body (Holford, 1986). To improve on Warfarin anticoagulation therapy, algorithms have incorporated genetic information to aid in identifying the factors behind clotting issues like Warfarin clearance rate, improving dosage and therapy (Gong et al., 2011). 

Now, with the power of AI, researchers can expedite the personalization of treatments for various disorders and medications, similar to what was accomplished with Warfarin but in a fraction of the time. AI algorithms are starting to analyze an individual’s genetic profile to predict their specific responses to various medications. This approach enables healthcare providers to fine-tune treatment plans, taking into account an individual’s unique genetic makeup, thus optimizing the effectiveness of therapies and reducing the potential for adverse effects. The integration of AI not only enhances the precision of pharmacogenomics but also streamlines the process, ultimately leading to safer and more efficient medical care tailored to each patient’s genetic characteristics.

The ultimate aspiration is to develop a sophisticated AI-driven system that can accurately forecast how each individual will react to specific medications, with the potential to bypass the conventional, time-consuming method of starting with the lowest effective dose and incrementally adjusting it upwards. This trial-and-error approach often leads to prolonged periods of uncertainty and potential adverse side effects for patients. Such advancements not only boost the precision of healthcare but also elevate the overall quality of life for patients seeking rapid relief and improved well-being.

Moreover, the integration of AI in pharmacogenomics has the potential to significantly expedite clinical trial programs. By tailoring medication doses to specific genetic backgrounds, AI aids at all three phases of the clinical trial process. This approach not only streamlines the trials but also offers substantial time and cost savings. The ability to tailor treatments for different genetic subgroups ensures that clinical trials are more efficient, bringing new therapies to market faster and ultimately benefiting patients in need.


The union of genomics and clinical notes, facilitated by AI, is ushering in a new era of precision medicine in biopharma. With the ability to predict individual drug responses and identify targeted therapies, this approach holds immense promise for improved treatment outcomes and a patient-centric view of medicine. As AI companies continue to advance their capabilities, the future of precision medicine for many diseases is looking closer than ever. The key to unlocking its full potential lies in the availability of high-quality data that comprehensively spans the entire patient journey. The integration of such diverse health-related data is central to driving valuable insights for drug development, making AI a driving force in the future of healthcare.



Batch, K. E., Yue, J., Darcovich, A., Lupton, K., Liu, C. C., Woodlock, D. P., El Amine, M. A. K., Causa-Andrieu, P. I., Gazit, L., Nguyen, G. H., Zulkernine, F., Do, R. K. G., & Simpson, A. L. (2022). Developing a Cancer Digital Twin: Supervised Metastases Detection From Consecutive Structured Radiology Reports. Frontiers in Artificial Intelligence, 5.

Gong, I. Y., Schwarz, U. I., Crown, N., Dresser, G. K., Lazo-Langner, A., Zou, G., Roden, D. M., Stein, C. M., Rodger, M., Wells, P. S., Kim, R. B., & Tirona, R. G. (2011). Clinical and genetic determinants of warfarin pharmacokinetics and pharmacodynamics during treatment initiation. PloS One, 6(11), e27808.

Gutierrez, C., & Schiff, R. (2011). HER2: biology, detection, and clinical implications. Archives of Pathology & Laboratory Medicine, 135(1), 55–62.

Holford, N. H. (1986). Clinical pharmacokinetics and pharmacodynamics of warfarin. Understanding the dose-effect relationship. Clinical Pharmacokinetics, 11(6), 483–504.

Quazi, S. (2022). Artificial intelligence and machine learning in precision and genomic medicine. Medical Oncology (Northwood, London, England), 39(8), 120.

Stahlberg, E. A., Abdel-Rahman, M., Aguilar, B., Asadpoure, A., Beckman, R. A., Borkon, L. L., Bryan, J. N., Cebulla, C. M., Chang, Y. H., Chatterjee, A., Deng, J., Dolatshahi, S., Gevaert, O., Greenspan, E. J., Hao, W., Hernandez-Boussard, T., Jackson, P. R., Kuijjer, M., Lee, A., … Zervantonakis, I. (2022). Exploring approaches for predictive cancer patient digital twins: Opportunities for collaboration and innovation. Frontiers in Digital Health, 4, 1007784.

Vazquez-Levin, M. H., Reventos, J., & Zaki, G. (2023). Editorial: Artificial intelligence: A step forward in biomarker discovery and integration towards improved cancer diagnosis and treatment. Frontiers in Oncology, 13, 1161118.