Skip to main content

Health Data Types

unstructured data

From Doctors' Notes to New Therapies: The Promise of Unstructured Data

By Health Data Types, Healthcare Data

Key Takeaways:

  • Unstructured data from sources like clinical notes can provide valuable real-world insights to augment structured clinical trial data in drug development.
  • Natural language processing (NLP) enables mining of unstructured text data for information on drug efficacy, side effects, patient behaviors, and more.
  • Challenges include data privacy, integration across sources, and developing reliable NLP models to extract accurate insights.
  • Proper governance and cross-functional collaboration is needed to safely and effectively leverage unstructured data.
  • Responsible use of unstructured notes has the potential to accelerate drug development, improve safety monitoring, and support value-based care models.

The Untapped Potential of Unstructured Data

In the meticulous world of drug development, every data point is precious. Clinical trials generate a wealth of rigorously structured efficacy and safety data. During routine clinical care, computerized physician order entry systems and electronic health records capture structured data that can enhance drug efficacy and safety monitoring. However, an underutilized treasure trove of real-world information exists in the unstructured text of clinical notes, hospital records, and other loosely formatted sources gathered as part of standard medical practice.

Unleashing Insights with Natural Language Processing

Historically, this unstructured data has been difficult to integrate and analyze alongside its structured counterparts. This is partly due to variability in documentation practices among different healthcare providers. Additionally, the extraction of relevant data has traditionally relied on manual review and interpretation by clinically trained personnel. But major pharmaceutical companies are now investing heavily in natural language processing (NLP) to mine these unstructured sources for insights.

While NLP does not eliminate the need for human involvement, it can significantly streamline the process. NLP serves as a tool that works in conjunction with human interaction, combining the efficiency of intelligent automation with the ability to incorporate human feedback. This combination allows for more effective extraction of insights from unstructured data, which ultimately aids in accelerating research, optimizing clinical trials, and enhancing drug safety monitoring.

Applications: From Safety Signals to Patient Experiences

So when and how can unstructured data provide value? One key application is using NLP models trained on doctors’ notes to identify potential safety signals that may not surface until after a drug is approved and prescribed at scale. These real-world signals can prompt further investigations and narrow the “surrogate to reality” gap between clinical trials and clinical practice.

Unstructured data has also shown promise in two critical areas: better defining appropriate inclusion/exclusion criteria for clinical trials and identifying under-represented patient populations who may benefit from a treatment. By processing clinical records from diverse practices, researchers can find more suitable study cohorts for targeted recruitment efforts to ensure that clinical trials are more representative of real-world patient populations. Furthermore, analyzing unstructured data allows for a better understanding of real-world behaviors like treatment adherence or self-reporting of side effects.

Another valuable application of NLP is in tracking and codifying patients’ experiences based on anecdotal descriptions found in clinical notes. For example, phrases such as “The medicine made me feel queasy” can provide qualitative context around drug effects and quality-of-life. This context could support reporting requirements for post-marketing adverse events. Additionally, other qualitative context could complement clinical scoring tools used in the trial setting, potentially expediting label expansions for new indications.

Overcoming Obstacles to Implementation

Despite the opportunities, integrating unstructured data is not without challenges. Concerns around patient privacy and data security pose hurdles. While unstructured text provided for research purposes is typically de-identified, residual identifying information can remain. Utilizing data sources that have gone through multiple layers of de-identification efforts is crucial to mitigate this risk effectively. Further, reliably extracting structured insights from unstructured text across multiple source systems using NLP also presents difficulties.

Developing robust, production-grade NLP models requires immense training data, careful tuning to the healthcare/biomedical domain, and systematic quality testing. Merging unstructured insights with existing structured pipelines is also an intricate systems engineering challenge.

Data Governance and the Path Forward

Looking ahead, stakeholders agree that managed responsibly and embedded into robust data frameworks, unstructured real-world datasets can help drive high quality healthcare to its full potential. Pharmaceutical companies may find accelerated paths to drug approvals and label expansions. Payers could gain transparency to optimize formularies and pricing models. And ultimately, patients may benefit from better targeted treatments.



  • “Unlocking the Power of Unstructured Data in Drug Discovery.” DrugDiscoveryToday, 2021..
  • “Natural Language Processing in Drug Safety.” Pharmaceutical Medicine, 2021.
  • “Bridging the ‘Efficacy-to-Effectiveness’ Gap.” BioPharmaDive, 2022.
  • “NLP for Clinical Trial Eligibility Criteria.” NEJM Catalyst, 2021.
  • “NLP Applications in Life Sciences and Healthcare.” Optum, 2022.
  • “Challenges of Integrating Unstructured Data in Healthcare.” MIT, 2020.
  • “Data, RWE and the Future of Value-Based Care.” IQVIA, 2022.




The Basics of OMOP – Data Standardization in Healthcare

By Health Data Types, Healthcare Data

Key Takeaways:

  • OMOP Defined: The Observational Medical Outcomes Partnership (OMOP) is a common data model for organizing healthcare data from various sources.
  • Objective: OMOP aims to standardize and integrate diverse healthcare data facilitating analysis and research.
  • Data Structuring: It organizes data into standard tables and fields (observations, procedures, drug exposures, conditions, etc.), enhancing analytics across datasets.
  • Enhanced Analysis: A common data model allows for larger data pooling, increasing statistical power in analysis.
  • Privacy Protection: OMOP prioritizes patient privacy, using de-identified data while retaining analytical utility.

What is OMOP Data?

OMOP represents a collaborative effort to standardize the transformation and analysis of healthcare data from diverse sources. Its goal is to optimize observational data for comparative research and analytics. The OMOP Common Data Model (CDM) prescribes a structured format for organizing heterogeneous healthcare data, encompassing demographics, encounters, procedures and more. This facilitates cross-platform analytics and queries. Notably, OMOP is a blueprint for data organization, not a database. It supports data standardization across platforms, leading to more robust datasets.

Key Features of OMOP:

  • Vocabulary Standards: For coding concepts like conditions and medications.
  • Standard Formats: For dates, codes and relational data structures.
  • Person-Centric Model: Data connected to individuals over time.
  • Support for Various Data Types: Like EHR, claims, registries, etc.
  • Open Source Licensing: Promotes free implementation and continuous evolution of the standard.

OMOP’s standardization ensures key clinical concepts are represented uniformly, balancing analytical utility with patient privacy.

Use of OMOP Data:

OMOP facilitates practical medical research by standardizing observational data, enabling:

  • Cross-platform analytics on combined datasets.
  • Reproduction of analyses and sharing of methods.
  • Application of predictive models across diverse data types.
  • Support for safety surveillance and pharmacovigilance.
  • Conducting population health studies and comparative effectiveness research.

Implemented by a variety of organizations, OMOP enables significant analytical use cases, including drug safety signal detection, real-world treatment outcome analysis and population health forecasting. By creating a common language for healthcare data, OMOP fosters data integration and analysis on a larger scale, accelerating health research.


Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model Specifications

Hripcsak, G., Duke, J.D., Shah, N.H., Reich, C.G., Huser, V., Schuemie, M.J. et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015

Overview of the OMOP Common Data Model –

Health data

Understanding Key Health Data Types: Clinical Trials, Claims, EHRs

By Clinical Trials, EHR, Health Data Types

Key Takeaways:

  • Key healthcare data types include clinical trials, insurance claims, and electronic health records (EHRs), each with distinct purposes.
  • Clinical trial data directly captures efficacy and safety of interventions, but availability is limited until publication and may lack generalizability.
  • Insurance claims provide large-scale utilization patterns, outcomes metrics across diverse groups, and cost analysis, but lack clinical precision.
  • EHR data offers longitudinal individual patient history and care details in operational workflows but quality and standardization varies.
  • Combining evidence across clinical trials, claims data, and EHRs enables real-world monitoring of interventions to guide optimal decisions and policies.

In an era of big data and analytics-driven healthcare, evidence informing clinical and policy decisions draws from an expanding variety of data sources that capture different aspects of patient care and outcomes. Three vital sources of health data include structured databases tracking results of clinical trials, administrative insurance claims systems, and electronic health records (EHRs) compiled at hospitals and health systems. Each data type serves distinct purposes with inherent strengths and limitations.

This article explains the defining characteristics, appropriate use cases, and limitations of clinical trials, insurance claims data, and EHRs for healthcare and life science researchers, operators, and innovators.. Combining complementary dimensions across data types enables robust real-world monitoring of healthcare interventions to guide optimal decisions and policies matched to specific populations.

Clinical Trials

The randomized controlled trial (RCT) serves as the gold standard for evaluating safety and efficacy of diagnostic tests, devices, biologics, and therapeutics prior to regulatory approvals. Clinical trials compare treatments in specific patient groups, following strict protocols and monitoring outcomes over a set study period. Data elements captured include administered treatments, predefined clinical outcomes, patient-reported symptoms, clinician assessments, precision diagnostics, genomic biomarkers, other quantifiable endpoints, and adverse events.

RCT datasets supply the most scientifically valid assessment of efficacy and toxicity for an intervention compared to alternatives like placebos or other drugs because influential variables are intentionally balanced across study arms using eligibility criteria and random assignment. This internal validity comes at a cost of potentially reduced generalizability and applicability. As a result there is a challenge in translating benefits and risks accurately into heterogeneous real-world populations. Published trial findings often overstate effectiveness when applied more broadly. Additional data from pragmatic studies is needed to complement classical efficacy findings along the product lifecycle.

Supplemental data integration is required to expand evidence beyond the limited snapshots of clinical trial participants and into continuous monitoring of outcomes across wider populations who are prescribed the treatments clinically. Here the high-level perspectives of insurance claims data and granular clinical details contained in EHRs play a vital role.

Insurance Claims

Administrative claims systems maintained by public and commercial health insurers serve payment and reimbursement purposes rather than research goals. Yet analysis of population-level claims data containing coded diagnoses, procedures performed, medications dispensed, specialty types, facilities visited, costs billed and reimbursed enables important usage trends, treatment patterns, acute events, and cost efficiency insights which complements clinical trials.

Claims provide researchers a broad window into diagnoses, prescribed interventions, and health outcomes frequently spanning millions of covered lives across geographical regions that are absent from most trials. Claims data encompasses at all covered care delivered rather than isolated interventions. Examining trends over longer timeframes across more diverse patients who differ from strict trial eligibility enables assessment of real-world utilization frequencies, comparative effectiveness versus alternatives, clinical guideline adherence, acute complication rates, mortality metrics, readmission trends, and direct plus indirect medical costs.

However, claims data lacks the precise clinical measures systematically captured in trials and EHR records. Billing codes often fail to specify clinical severity or capture quality of life impacts. Available data elements focus primarily on how much and how often healthcare services are used rather than qualitative clinical details or patient-reported outcomes. Underlying diagnoses and accuracy of coding may require supplementary validation. Despite its limitations, claims data plays a crucial role in providing essential information for healthcare professionals, researchers, and policymakers. It serves as a valuable tool for monitoring diverse aspects of the healthcare system, ultimately contributing to the assurance of efficient, safe, and effective treatments.

While abbreviated claims codes document utilization events at a population level and clinical trials quantify experience for circumscribed groups, the patient-centric Electronic Health Record (EHR) details comprehensive individual-level clinical data as an immutable ledger accumulated over years of clinical encounters across care settings. The longitudinal EHR chronicles detailed diagnoses, signs and symptoms, lab orders and results, exam findings, procedures conducted, prescriptions written, physician notes and orders, referral details, communications around critical results, and other discrete or unstructured elements reflecting patient complexity often excluded from claims data and trials.


EHRs provide fine-grained data for precision medicine inquiries into subsets of patients with common clinical trajectories, risk profiles, comorbidities, socioeconomic factors, access challenges, genomic risks, family histories of related illnesses, lifestyle behaviors like smoking, and personalized interventions based on advanced molecular markers. EHR data supports deep phenotyping algorithms and temporal pattern analyses that can extract cohort comparisons not feasible solely from claims.

Secondary use of EHR data faces challenges in representativeness when drawing data from single health systems rather than national networks, variability in coding terminologies and data entry fields across platforms, fragmentation forcing linkage between separate specialties and sites of care, semi-structured formats with mixed discrete codified and free text variables, and data quality gaps during clinician workflow constraints. Population-based claims data ensures inclusion of patients seeking care across all available providers rather than just one health system.

Integrating Complementary Evidence

Definitive clinical trial efficacy remains the gold standard when initially evaluating medical interventions, while large-scale claims data offers a complementary view of broader utilization patterns and comparative outcomes across more diverse populations who are receiving interventions in clinical practice. However, as interventions diffuse beyond the research setting, reliable acquisition of clinical details requires merging population-based signals from claims with deep clinical data contained uniquely within EHRs.

Combining evidence across clinical trials, claims databases, and EHR repositories maximizes strengths of each data type while overcoming inherent limitations of any single source. Clinical trials determine effectiveness, and combining insights from large-scale claims data with detailed clinical information in EHRs is crucial for assessing interventions as they transition from research to practical healthcare, contributing to overall healthcare improvement.


Aspect Clinical Trial Data Claims Data EHR Data
Primary Purpose Research and development of new treatments Billing and reimbursement for services Patient care and health record keeping
Data Source Controlled clinical studies Insurance companies, healthcare providers Healthcare providers
Data Types Included Patient demographics, treatment details, outcomes Patient demographics, services rendered, cost Patient demographics, medical history, diagnostics, treatment plans
Data Structure Highly structured and standardized Structured but varies with payer systems Structured and unstructured (e.g., doctor’s notes)
Temporal Span Limited to the duration of the trial Longitudinal, covering the duration of coverage Longitudinal, covering comprehensive patient history
Access and Privacy Restricted, subject to clinical trial protocols Restricted, governed by health insurance portability and accountability act (HIPAA) regulations Restricted, governed by HIPAA and patient consent
Primary Users Researchers, pharmaceutical companies Healthcare providers, payers, policy makers Healthcare providers, patients
Data Volume and Variety Relatively limited, focused on specific conditions Large, diverse covering a wide range of conditions and services Large, diverse, includes a wide range of medical information
Use in Healthcare Drug development, understanding treatment effectiveness Healthcare economics, policy making, fraud detection Direct patient care, diagnosis, treatment planning
Challenges Limited generalizability, high cost Variability in coding, potential for missing data Inconsistent data entry, variability in EHR systems