Health data

Understanding Key Health Data Types: Clinical Trials, Claims, EHRs

By | Clinical Trials, EHR, Health Data Types

Key Takeaways:

  • Key healthcare data types include clinical trials, insurance claims, and electronic health records (EHRs), each with distinct purposes.
  • Clinical trial data directly captures efficacy and safety of interventions, but availability is limited until publication and may lack generalizability.
  • Insurance claims provide large-scale utilization patterns, outcomes metrics across diverse groups, and cost analysis, but lack clinical precision.
  • EHR data offers longitudinal individual patient history and care details in operational workflows but quality and standardization varies.
  • Combining evidence across clinical trials, claims data, and EHRs enables real-world monitoring of interventions to guide optimal decisions and policies.

In an era of big data and analytics-driven healthcare, evidence informing clinical and policy decisions draws from an expanding variety of data sources that capture different aspects of patient care and outcomes. Three vital sources of health data include structured databases tracking results of clinical trials, administrative insurance claims systems, and electronic health records (EHRs) compiled at hospitals and health systems. Each data type serves distinct purposes with inherent strengths and limitations.

This article explains the defining characteristics, appropriate use cases, and limitations of clinical trials, insurance claims data, and EHRs for healthcare and life science researchers, operators, and innovators.. Combining complementary dimensions across data types enables robust real-world monitoring of healthcare interventions to guide optimal decisions and policies matched to specific populations.

Clinical Trials

The randomized controlled trial (RCT) serves as the gold standard for evaluating safety and efficacy of diagnostic tests, devices, biologics, and therapeutics prior to regulatory approvals. Clinical trials compare treatments in specific patient groups, following strict protocols and monitoring outcomes over a set study period. Data elements captured include administered treatments, predefined clinical outcomes, patient-reported symptoms, clinician assessments, precision diagnostics, genomic biomarkers, other quantifiable endpoints, and adverse events.

RCT datasets supply the most scientifically valid assessment of efficacy and toxicity for an intervention compared to alternatives like placebos or other drugs because influential variables are intentionally balanced across study arms using eligibility criteria and random assignment. This internal validity comes at a cost of potentially reduced generalizability and applicability. As a result there is a challenge in translating benefits and risks accurately into heterogeneous real-world populations. Published trial findings often overstate effectiveness when applied more broadly. Additional data from pragmatic studies is needed to complement classical efficacy findings along the product lifecycle.

Supplemental data integration is required to expand evidence beyond the limited snapshots of clinical trial participants and into continuous monitoring of outcomes across wider populations who are prescribed the treatments clinically. Here the high-level perspectives of insurance claims data and granular clinical details contained in EHRs play a vital role.

Insurance Claims

Administrative claims systems maintained by public and commercial health insurers serve payment and reimbursement purposes rather than research goals. Yet analysis of population-level claims data containing coded diagnoses, procedures performed, medications dispensed, specialty types, facilities visited, costs billed and reimbursed enables important usage trends, treatment patterns, acute events, and cost efficiency insights which complements clinical trials.

Claims provide researchers a broad window into diagnoses, prescribed interventions, and health outcomes frequently spanning millions of covered lives across geographical regions that are absent from most trials. Claims data encompasses at all covered care delivered rather than isolated interventions. Examining trends over longer timeframes across more diverse patients who differ from strict trial eligibility enables assessment of real-world utilization frequencies, comparative effectiveness versus alternatives, clinical guideline adherence, acute complication rates, mortality metrics, readmission trends, and direct plus indirect medical costs.

However, claims data lacks the precise clinical measures systematically captured in trials and EHR records. Billing codes often fail to specify clinical severity or capture quality of life impacts. Available data elements focus primarily on how much and how often healthcare services are used rather than qualitative clinical details or patient-reported outcomes. Underlying diagnoses and accuracy of coding may require supplementary validation. Despite its limitations, claims data plays a crucial role in providing essential information for healthcare professionals, researchers, and policymakers. It serves as a valuable tool for monitoring diverse aspects of the healthcare system, ultimately contributing to the assurance of efficient, safe, and effective treatments.

While abbreviated claims codes document utilization events at a population level and clinical trials quantify experience for circumscribed groups, the patient-centric Electronic Health Record (EHR) details comprehensive individual-level clinical data as an immutable ledger accumulated over years of clinical encounters across care settings. The longitudinal EHR chronicles detailed diagnoses, signs and symptoms, lab orders and results, exam findings, procedures conducted, prescriptions written, physician notes and orders, referral details, communications around critical results, and other discrete or unstructured elements reflecting patient complexity often excluded from claims data and trials.


EHRs provide fine-grained data for precision medicine inquiries into subsets of patients with common clinical trajectories, risk profiles, comorbidities, socioeconomic factors, access challenges, genomic risks, family histories of related illnesses, lifestyle behaviors like smoking, and personalized interventions based on advanced molecular markers. EHR data supports deep phenotyping algorithms and temporal pattern analyses that can extract cohort comparisons not feasible solely from claims.

Secondary use of EHR data faces challenges in representativeness when drawing data from single health systems rather than national networks, variability in coding terminologies and data entry fields across platforms, fragmentation forcing linkage between separate specialties and sites of care, semi-structured formats with mixed discrete codified and free text variables, and data quality gaps during clinician workflow constraints. Population-based claims data ensures inclusion of patients seeking care across all available providers rather than just one health system.

Integrating Complementary Evidence

Definitive clinical trial efficacy remains the gold standard when initially evaluating medical interventions, while large-scale claims data offers a complementary view of broader utilization patterns and comparative outcomes across more diverse populations who are receiving interventions in clinical practice. However, as interventions diffuse beyond the research setting, reliable acquisition of clinical details requires merging population-based signals from claims with deep clinical data contained uniquely within EHRs.

Combining evidence across clinical trials, claims databases, and EHR repositories maximizes strengths of each data type while overcoming inherent limitations of any single source. Clinical trials determine effectiveness, and combining insights from large-scale claims data with detailed clinical information in EHRs is crucial for assessing interventions as they transition from research to practical healthcare, contributing to overall healthcare improvement.


Aspect Clinical Trial Data Claims Data EHR Data
Primary Purpose Research and development of new treatments Billing and reimbursement for services Patient care and health record keeping
Data Source Controlled clinical studies Insurance companies, healthcare providers Healthcare providers
Data Types Included Patient demographics, treatment details, outcomes Patient demographics, services rendered, cost Patient demographics, medical history, diagnostics, treatment plans
Data Structure Highly structured and standardized Structured but varies with payer systems Structured and unstructured (e.g., doctor’s notes)
Temporal Span Limited to the duration of the trial Longitudinal, covering the duration of coverage Longitudinal, covering comprehensive patient history
Access and Privacy Restricted, subject to clinical trial protocols Restricted, governed by health insurance portability and accountability act (HIPAA) regulations Restricted, governed by HIPAA and patient consent
Primary Users Researchers, pharmaceutical companies Healthcare providers, payers, policy makers Healthcare providers, patients
Data Volume and Variety Relatively limited, focused on specific conditions Large, diverse covering a wide range of conditions and services Large, diverse, includes a wide range of medical information
Use in Healthcare Drug development, understanding treatment effectiveness Healthcare economics, policy making, fraud detection Direct patient care, diagnosis, treatment planning
Challenges Limited generalizability, high cost Variability in coding, potential for missing data Inconsistent data entry, variability in EHR systems



EHR Data

Claims Data vs EHRs: Distinct but United in Real-World Research

By | EHR

In healthcare research, real-world data from patients is invaluable for gaining insights into disease patterns, treatment effectiveness, and outcomes. Two major sources of real-world data are claims data and electronic health records (EHRs). Both data types have distinct advantages and limitations that impact their utility for different research applications. This article examines the key differences between claims data and EHR data and how researchers can leverage both data sources to answer critical healthcare questions.

What is Claims Data?

Health insurance claims data contains information submitted by healthcare providers to payers to receive reimbursement for services rendered to patients. Claims data includes demographic details about the patient such as age, gender, location, insurance details, diagnosis codes, procedure codes, prescription details, costs and reimbursement information. Claims data provides a longitudinal view of a patient’s interactions across healthcare systems, as it captures data each time a claim is filed over months or years of care (Pivovarov et al., 2019).

Large claims clearinghouses aggregate data from millions of patients across different payers, providing massive real-world datasets. For example, IBM MarketScan databases contain claims information on over 240 million US patients collected from employers, health plans and government health programs (IBM Watson Health, 2022). Other major claims aggregators include Optum Clinformatics, Premier Healthcare Database and PharMetrics Plus.

Key Details Captured in Claims Data

  • Patient demographics – age, gender, location, insurance details
  • Diagnoses – ICD diagnosis codes
  • Procedures – CPT and HCPCS procedure codes
  • Medications – NDC codes, dose, number of prescription refills
  • Costs – total and itemized costs, amount paid by payer and patient responsibility

Claims data is extremely valuable for comparative effectiveness research, pharmacoepidemiology, health economics and outcomes research (Berger et al., 2017). The large sample sizes and longitudinal view make claims databases ideal for studying disease incidence, treatment patterns, medication adherence, healthcare costs and utilization across different patient demographics and      therapeutic areas.

Limitations of Claims Data

While claims data offers unparalleled scale and longitudinal perspective, researchers must be aware of its limitations:

  • Diagnosis codes are not indicative of a confirmatory diagnosis; for example, a provider might submit a claim with a diagnosis code that is being considered during a diagnostic workup. 
  • Diagnosis and procedure codes may be inaccurate or incomplete if providers submit improper codes. Important clinical details are missing.
  • Prescription records lack information about whether the medication was taken as directed or refilled properly.
  • Available data elements are restricted to what is required for reimbursement. No additional clinical context is provided.
  • Inability to link family members or track individuals who change payers over time.
  • Variable data quality and completeness across different claims sources.
  • Biased sampling based on specific payer population characteristics. May not represent the general population.

Despite these limitations, claims data remains highly useful for epidemiologic studies, health economics research, population health analyses and other applications where large sample sizes are critical. Researchers should account for the nuances of claims data during study design and analysis.

What are Electronic Health Records?

Electronic health records (EHRs) are a digital documentation of patient health information generated throughout clinical care. EHRs are maintained by healthcare organizations and contain various data elements documenting patient encounters, including (Hersh et al., 2013):

  • Demographics – age, gender, race, ethnicity, language
  • Medical history – conditions, diagnoses, allergies, immunizations, procedures
  • Medications – prescriptions, dosing instructions
  • Vital signs – blood pressure, heart rate, weight, height
  • Lab test results
  • Radiology images
  • Clinical notes – physician progress notes, discharge summaries

A key advantage of EHR data is its rich clinical context. While claims data only captures billing codes, EHRs include detailed narratives, quantitative measures, images and comprehensive documentation of each patient visit. This facilitates better understanding of disease presentation & progression, treatment rationale & response and patient complexity.

EHR databases aggregate records across large healthcare networks to compile real-world data on millions of patients. For instance, Vanderbilt University Medical Center’s Synthetic Derivative database contains de-identified medical records for over 3.7 million subjects and their BioVU® database contains over 310,000 DNA samples linked to de-identified medical records for genomics research (Roden et al., 2008).

EHR Data

Benefits of EHR Data

EHR data enables researchers to (Cowie et al., 2017):

  • Obtain granular clinical details beyond billing codes
  • Review physician notes and narratives for patient context
  • Link lab results, pathology reports, radiology images for additional phenotyping
  • Study unstructured data through natural language processing
  • Identify patient cohorts based on complex inclusion/exclusion criteria
  • Examine longitudinal disease patterns and treatment journeys

EHR data yields insights unattainable through claims data alone. The rich clinical details enable researchers to understand nuances in patient populations, disease manifestation and therapy response.

Challenges with EHR Data

While valued for its clinical context, EHR data also has some inherent limitations:

  • Incomplete or missing records if providers fail to properly document encounters
  • Incomplete records if patient receives care at multiple, unlinked healthcare networks
  • Inconsistent use of structured fields vs free text notes across systems
  • Lack of national standards in data formats, terminologies and definitions
  • Biased datasets dependent on specific health system patient population
  • Difficulty normalizing data across disparate EHR systems
  • Requires data science skills to analyze unstructured notes and documents
  • Requires clinical background to appropriately interpret unstructured notes and documents
  • More resource intensive for data extraction and processing compared to claims data

EHR data analysis requires specialized skills and infrastructure, especially to interpret unstructured data. Despite limitations, EHRs remain an invaluable data source on their own or as complements to other data sources like claims for comprehensive real-world evidence generation.

Integrating Claims and EHR Datasets

Given the complementary strengths of claims data and EHRs, there is significant value in integrating these datasets to conduct robust real-world studies. This can be accomplished by (Maro et al., 2019):

  • Linking claims and EHR data at the patient level via unique identifiers
  • Building cohorts based on diagnosis codes from claims data, then reviewing clinical data for each patient in the EHR
  • Using natural language processing on EHR notes to extract additional details not available in claims
  • Applying claims analysis algorithms on EHR data to identify lapses in care, adverse events, etc.
  • Incorporating prescription fills from claims with medication orders in EHRs to assess adherence
  • Using cost data from claims combined with clinical data for health economic studies

Major research networks like PCORnet have developed infrastructure to integrate claims and EHR data to support large-scale patient-centered outcomes research. When thoughtfully combined, these complementary data sources enable multifaceted real-world studies not possible using either source alone.

     Claims data and EHRs both provide invaluable real-world evidence on patient populations, but have distinct strengths and limitations. Claims data allows longitudinal analysis of diagnosis, procedure and prescription patterns at scale, but lacks clinical granularity. EHRs provide rich clinical context like physician notes, lab results and images, but lack continuity across health systems and data standardization. By integrating these sources, researchers can conduct robust real-world studies leveraging the advantages of both datasets. Careful consideration of the nuances of each data type allows generation of comprehensive real-world evidence to inform healthcare decisions and improve patient outcomes.

At NashBio, we use EHR data for most of our analytic activities because of its depth and additional clinical context, which helps us build the highest fidelity study populations for our clients.


Berger, M. L., Sox, H., Willke, R. J., Brixner, D. L., Eichler, H.-G., Goettsch, W., … Schneeweiss, S. (2017). Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making. Pharmacoepidemiology and Drug Safety, 26(9), 1033–1039.

Cowie, M. R., Blomster, J. I., Curtis, L. H., Duclaux, S., Ford, I., Fritz, F., … Zalewski, A. (2017). Electronic health records to facilitate clinical research. Clinical Research in Cardiology, 106(1), 1–9.

Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R. O., Bernstam, E. V., Lehmann, H. P., Hripcsak, G., Hartzog, T. H., Cimino, J. J., & Saltz, J. H. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical care, 51(8 0 3), S30–S37.

IBM Watson Health. (2022). IBM MarketScan Databases.

Maro, J. C., Platt, R., Holmes, J. H., Stang, P. E., Steiner, J. F., & Douglas, M. P. (2019). Design of a study to evaluate the comparative effectiveness of analytical methods to identify patients with irritable bowel syndrome using administrative claims data linked to electronic medical records. Pharmacoepidemiology and Drug Safety, 28(2), 149–157.

Pivovarov, R., Albers, D. J., Hripcsak, G., Sepulveda, J. L., & Elhadad, N. (2019). Temporal trends of hemoglobin A1c testing. Journal of the American Medical Informatics Association, 26(1), 41–48.

Roden, D. M., Pulley, J. M., Basford, M. A., Bernard, G. R., Clayton, E. W., Balser, J. R., & Masys, D. R. (2008). Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clinical pharmacology and therapeutics, 84(3), 362–369.