Healthcare Data

unstructured data

From Doctors' Notes to New Therapies: The Promise of Unstructured Data

By | Health Data Types, Healthcare Data

Key Takeaways:

  • Unstructured data from sources like clinical notes can provide valuable real-world insights to augment structured clinical trial data in drug development.
  • Natural language processing (NLP) enables mining of unstructured text data for information on drug efficacy, side effects, patient behaviors, and more.
  • Challenges include data privacy, integration across sources, and developing reliable NLP models to extract accurate insights.
  • Proper governance and cross-functional collaboration is needed to safely and effectively leverage unstructured data.
  • Responsible use of unstructured notes has the potential to accelerate drug development, improve safety monitoring, and support value-based care models.

The Untapped Potential of Unstructured Data

In the meticulous world of drug development, every data point is precious. Clinical trials generate a wealth of rigorously structured efficacy and safety data. During routine clinical care, computerized physician order entry systems and electronic health records capture structured data that can enhance drug efficacy and safety monitoring. However, an underutilized treasure trove of real-world information exists in the unstructured text of clinical notes, hospital records, and other loosely formatted sources gathered as part of standard medical practice.

Unleashing Insights with Natural Language Processing

Historically, this unstructured data has been difficult to integrate and analyze alongside its structured counterparts. This is partly due to variability in documentation practices among different healthcare providers. Additionally, the extraction of relevant data has traditionally relied on manual review and interpretation by clinically trained personnel. But major pharmaceutical companies are now investing heavily in natural language processing (NLP) to mine these unstructured sources for insights.

While NLP does not eliminate the need for human involvement, it can significantly streamline the process. NLP serves as a tool that works in conjunction with human interaction, combining the efficiency of intelligent automation with the ability to incorporate human feedback. This combination allows for more effective extraction of insights from unstructured data, which ultimately aids in accelerating research, optimizing clinical trials, and enhancing drug safety monitoring.

Applications: From Safety Signals to Patient Experiences

So when and how can unstructured data provide value? One key application is using NLP models trained on doctors’ notes to identify potential safety signals that may not surface until after a drug is approved and prescribed at scale. These real-world signals can prompt further investigations and narrow the “surrogate to reality” gap between clinical trials and clinical practice.

Unstructured data has also shown promise in two critical areas: better defining appropriate inclusion/exclusion criteria for clinical trials and identifying under-represented patient populations who may benefit from a treatment. By processing clinical records from diverse practices, researchers can find more suitable study cohorts for targeted recruitment efforts to ensure that clinical trials are more representative of real-world patient populations. Furthermore, analyzing unstructured data allows for a better understanding of real-world behaviors like treatment adherence or self-reporting of side effects.

Another valuable application of NLP is in tracking and codifying patients’ experiences based on anecdotal descriptions found in clinical notes. For example, phrases such as “The medicine made me feel queasy” can provide qualitative context around drug effects and quality-of-life. This context could support reporting requirements for post-marketing adverse events. Additionally, other qualitative context could complement clinical scoring tools used in the trial setting, potentially expediting label expansions for new indications.

Overcoming Obstacles to Implementation

Despite the opportunities, integrating unstructured data is not without challenges. Concerns around patient privacy and data security pose hurdles. While unstructured text provided for research purposes is typically de-identified, residual identifying information can remain. Utilizing data sources that have gone through multiple layers of de-identification efforts is crucial to mitigate this risk effectively. Further, reliably extracting structured insights from unstructured text across multiple source systems using NLP also presents difficulties.

Developing robust, production-grade NLP models requires immense training data, careful tuning to the healthcare/biomedical domain, and systematic quality testing. Merging unstructured insights with existing structured pipelines is also an intricate systems engineering challenge.

Data Governance and the Path Forward

Looking ahead, stakeholders agree that managed responsibly and embedded into robust data frameworks, unstructured real-world datasets can help drive high quality healthcare to its full potential. Pharmaceutical companies may find accelerated paths to drug approvals and label expansions. Payers could gain transparency to optimize formularies and pricing models. And ultimately, patients may benefit from better targeted treatments.



  • “Unlocking the Power of Unstructured Data in Drug Discovery.” DrugDiscoveryToday, 2021..
  • “Natural Language Processing in Drug Safety.” Pharmaceutical Medicine, 2021.
  • “Bridging the ‘Efficacy-to-Effectiveness’ Gap.” BioPharmaDive, 2022.
  • “NLP for Clinical Trial Eligibility Criteria.” NEJM Catalyst, 2021.
  • “NLP Applications in Life Sciences and Healthcare.” Optum, 2022.
  • “Challenges of Integrating Unstructured Data in Healthcare.” MIT, 2020.
  • “Data, RWE and the Future of Value-Based Care.” IQVIA, 2022.



data diversity

Diversity in Data: Why It Matters for Drug Discovery

By | Healthcare Data

Key Takeaways:

  • The absence of diversity in clinical trial data can lead to biases and inequities in healthcare.
  • Regulators like the FDA are emphasizing the need for more diverse and representative data through initiatives like the Real World Evidence Program.
  • Having accurate, diverse real world data leads to more equitable and effective treatments by ensuring safety and efficacy across populations.
  • Pharmaceutical companies should prioritize capturing diverse real-world data and applying advanced analytics to identify variabilities in treatment response.

In recent years, a growing understanding has emerged regarding the critical need for diversity and representation in clinical research data. Historically, certain demographic groups such as women, minorities, and the elderly have been underrepresented in many clinical trials. This lack of diversity in the underlying data can lead to significant biases and inequities when new therapies are approved and launched.

For example, a landmark study in the early 1990s showed that women had been excluded from most major clinical trials, leading to gaps in knowledge about women’s responses to medications. The study found that eight out of ten prescription drugs withdrawn from the market posed greater health risks to women than men. This exemplifies the real dangers of not gathering data across diverse populations.

More recently, the COVID-19 pandemic has further revealed disparities in health outcomes and treatment responses between different demographic groups. Regulators have emphasized the need for clinical trials that are more representative of real-world diversity. In the United States, the Food and Drug Administration (FDA) now requires inclusion of underrepresented populations in clinical trials under the Improving Representation in Clinical Trials initiative.

The FDA has also created the Real World Evidence Program to evaluate the potential use of real-world data (RWD) from sources like electronic health records, insurance claims databases, and registries. The goal is to complement data from traditional trials with more diverse, real-world information on safety, effectiveness, and treatment response variabilities across patient subgroups.

Having access to accurate, representative real-world data enables more equitable and effective treatments in several key ways:

  1. Identifying safety issues or side effects that disproportionately impact certain populations based on factors like age, race, or comorbidities. This allows for better labeling and monitoring.
  2. Ensuring adequate efficacy across all segments of the patient population. Understanding variabilities in treatment response is key for optimal dosing guidance.
  3. Enabling development of targeted therapies for population subgroups where the risk-benefit profile may differ, such as pregnant women.
  4. Avoiding biases and inequities in access to treatment. Diverse data helps prevent therapies from being indicated for only limited populations.
  5. Informing appropriate use criteria and payor coverage decisions based on real-world comparative effectiveness across groups.

From a regulatory compliance perspective, lack of representation in trial data can also lead to delays or rejection of new drug and device applications. The FDA has advised that drugs may not be approvable if safety and efficacy has not been demonstrated across demographics.

Looking ahead, embracing diversity and representativeness throughout the drug discovery process will be critical. Pharmaceutical companies should make gathering inclusive, real-world data a priority. Advanced analytics techniques like machine learning can then help unlock insights about treatment response variabilities within diverse patient populations.

Ultimately, leveraging diverse and representative data will lead to more equitable, effective personalized healthcare and better outcomes for all patients.


  • Improving Representation in Clinical Trials and Research: FDA’s New Efforts to Bridge the Gap – FDA
  • Real-World Evidence – FDA
  • Racial and Ethnic Differences in Response to Medicines: Towards Individualized Pharmaceutical Treatment – NIH
  • Addressing sex, gender, and intersecting social identities across the translational science spectrum – NIH
  • Utilizing Real-World Data for Clinical Trials: The Role of Data Curators – NIH

The Basics of OMOP – Data Standardization in Healthcare

By | Health Data Types, Healthcare Data

Key Takeaways:

  • OMOP Defined: The Observational Medical Outcomes Partnership (OMOP) is a common data model for organizing healthcare data from various sources.
  • Objective: OMOP aims to standardize and integrate diverse healthcare data facilitating analysis and research.
  • Data Structuring: It organizes data into standard tables and fields (observations, procedures, drug exposures, conditions, etc.), enhancing analytics across datasets.
  • Enhanced Analysis: A common data model allows for larger data pooling, increasing statistical power in analysis.
  • Privacy Protection: OMOP prioritizes patient privacy, using de-identified data while retaining analytical utility.

What is OMOP Data?

OMOP represents a collaborative effort to standardize the transformation and analysis of healthcare data from diverse sources. Its goal is to optimize observational data for comparative research and analytics. The OMOP Common Data Model (CDM) prescribes a structured format for organizing heterogeneous healthcare data, encompassing demographics, encounters, procedures and more. This facilitates cross-platform analytics and queries. Notably, OMOP is a blueprint for data organization, not a database. It supports data standardization across platforms, leading to more robust datasets.

Key Features of OMOP:

  • Vocabulary Standards: For coding concepts like conditions and medications.
  • Standard Formats: For dates, codes and relational data structures.
  • Person-Centric Model: Data connected to individuals over time.
  • Support for Various Data Types: Like EHR, claims, registries, etc.
  • Open Source Licensing: Promotes free implementation and continuous evolution of the standard.

OMOP’s standardization ensures key clinical concepts are represented uniformly, balancing analytical utility with patient privacy.

Use of OMOP Data:

OMOP facilitates practical medical research by standardizing observational data, enabling:

  • Cross-platform analytics on combined datasets.
  • Reproduction of analyses and sharing of methods.
  • Application of predictive models across diverse data types.
  • Support for safety surveillance and pharmacovigilance.
  • Conducting population health studies and comparative effectiveness research.

Implemented by a variety of organizations, OMOP enables significant analytical use cases, including drug safety signal detection, real-world treatment outcome analysis and population health forecasting. By creating a common language for healthcare data, OMOP fosters data integration and analysis on a larger scale, accelerating health research.


Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model Specifications

Hripcsak, G., Duke, J.D., Shah, N.H., Reich, C.G., Huser, V., Schuemie, M.J. et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015

Overview of the OMOP Common Data Model –

healthcare data

De-identification: Balancing Privacy and Utility in Healthcare Data

By | Healthcare Data

Key Takeaways:

  • De-identification is the process of removing or obscuring personal health information in medical records to protect patient privacy.
  • De-identification is critical for enabling the sharing of data for secondary research purposes such as public health studies while meeting privacy regulations like HIPAA.
  • Common de-identification techniques include suppression, generalization, perturbation and synthetic data generation.
  • There is often a balance between data utility and privacy risk that must be evaluated on a case-by-case basis when de-identifying data.
  • Emerging privacy-enhancing computation methods like federated learning and differential privacy offer complementary approaches to de-identification.

What is De-identification and Why is it Important for Healthcare?

Patient health information is considered highly sensitive data in need of privacy protections. However medical data sharing enables critically important research on public health, personalized medicine and more. De-identification techniques that remove identifying information and decrease the risk of exposing protected health information serve a crucial role in balancing these needs for privacy and innovation.

Definitions and Concepts

The HIPAA Privacy Rule defines de-identification as the process of preventing a person’s identity from being connected with health information. Once data has been de-identified per the Privacy Rule’s standards, it is no longer considered protected health information (PHI) and can be freely shared for research use cases like public health studies, therapeutic effectiveness studies  and medical informatics analytics.

Perfect de-identification that carries no risk of re-identification of patients is very difficult, if not impossible, to accomplish with current technology. As a result, regulations like HIPAA allow for formal designations of “de-identified” health data based on achieving sufficient pseudonymity through the suppression or generalization of identifying tag elements. HIPAA also defines a limited data set containing certain scrubbed identifiers that can be shared with a data use agreement rather than fully stripped identifiers.

The re-identification risk spectrum ranges from blatant identifiers like names, home addresses and social security numbers to quasi-identifiers like birthdates and narrowed locations that would not directly name the patient but could be pieced together to deduce identity in combination, especially as external data sources grow more public over time. State-of-the-art de-identification evaluates both blatant and quasi-identifier risks to minimize traceability while maximizing analytic utility.

Motivating Use Cases

Research and public health initiatives rely on the sharing of de-identified health data to drive progress on evidence and outcomes. The Cancer Moonshot’s data sharing efforts highlight the massive potential impact of medical databases, cohorts and real-world evidence generation on accelerating cures via de-identified data aggregation and analytics. The open FDA program demonstrates governmental encouragement of privacy-respecting access to regulatory datasets to inform digital health entrepreneurs. Patient matching in these fragmented healthcare datasets would be impossible using directly identifiable data. Apple’s ResearchKit and CareKit frameworks facilitate de-identified mobile health data sharing for app developers to build new participatory research applications.

Data marketplaces and trusted third parties are emerging to certify and exchange research-ready, consented data assets like clinico-genomic data underlying scientific publications and clinical trials. Startups and health systems manage data sharing agreements and audit logs around distributed sites leveraging de-identified data. Rich metadata combined with privacy-preserving record linkage techniques that avoid direct identifiers enables specific patient subgroup analytics without compromise.

Overall research efficiency improves when more participants openly share their health data. But none of this research progress would be possible if stringent de-identification practices were not implemented to earn patient trust in data sharing.

De-Identification Techniques and Standards

There are two high level categories of common de-identification protocols in healthcare: 1) suppressing blatant identifiers, typically following frameworks like HIPAA, and 2) actively transforming the data itself through various forms of generalization, perturbation or synthetic data production.

Suppressing Identifiers

The HIPAA Privacy Rule designates 18 categories of Protected Health Information identifiers that must be removed to achieve de-identified status, including names, geographic details narrower than state level, all dates other than years, contact information, IDs and record numbers, vehicle and device identifiers, URLs, IP addresses, biometrics etc.

Messages, images and unstructured data require specialized redaction processes to scrub both blatant and quasi-identifiers related to the patient, provider, institution or researchers involved. Named entity recognition and text annotation techniques help automate the detection of identifiable concepts. Voice data and video are more challenging mediums to de-identify.

Generalization and Aggregation

When formal dates, locations, ages over 89 and other quasi-identifiers cannot be completely suppressed without losing analytic value from the structured data, generalization techniques help band these details into abstract categories to preserve some descriptive statistics while hiding individual values.

Aggregating masked data across many patient records also prevents isolation of individuals. Row level de-identification risks in sparse data featuring outliers and uncommon combinations of traits can be mitigated by pooling data points into broader summaries before release rather than allowing raw access.


Perturbation encompasses a wide array of mathematical and statistical data alteration techniques that aim to distort the original data values and distributions while maintaining the general trends and correlations warranting analysis.

Value distortion methods include quantization to normalize numbers into ranges, squeezing and stretching value dispersion, rounding or truncating decimals, swapping similar records and discretizing continuous variables. Objects can be clustered into groups that are then analyzed in aggregate. Multiple perturbed versions of the dataset can be safely released to enable reproducible confirmation of discovered associations while avoiding leakage of the precise source data.

Combinations of generalization and perturbation provide flexibility for particular data types and contexts. The strengths, weaknesses and tuning of parameters merit a technical deep dive. The key is calibrating perturbation to maximize analytic integrity while minimizing correlation risk. Ongoing access rather than static publication also allows refinement of data treatment to meet evolving security assumptions and privacy regulations.

Synthetic Data

Synthetic datasets represent an emerging approach for modeling realistic artificial data distributions that resemble an actual patient group without containing the original records. Once the statistical shape of data properties is learned from the genuine dataset, simulated synthetic data can be sampled from generative models that emulate plausible features and relationships without allowing deduction of the real samples.

The underlying models must sufficiently capture multidimensional interactions and representation of minority groups within the patient population. Features such as ethnicity, outcomes, treatments and behaviors must be appropriately represented instead of using simplistic or biased summary statistics that ignore important correlations. Synthetic data techniques applying machine learning and differential privacy mechanisms to reconstruct distributions show significant promise for shared data sandbox environments. Cloud vendors like AWS, Google Cloud and Microsoft Azure now provide synthetic data services.

Evaluating the Risk-Utility Tradeoff

Ideally, de-identified health data removes enough identifying risk to prevent adversaries from recognizing individuals while retaining enough fidelity to offer scientific utility for the intended analyses by qualified researchers. But optimizing both privacy protection and analytic value requires navigating technical and ethical nuances around plausible re-identification vulnerabilities and scenarios balanced against access restrictions on derivative insights in the public interest.

Quantitative statistical metrics like k-anonymity models attempt to mathematically define anonymity sets with at least k records containing a combination of quasi-identifiers to avoid isolation. L-diversity metrics further generalize and dilute these groups to limit confidence of guessing the matching identity. Closeness measures how much perturbation may have altered correlations. Quantifying information loss helps data curators shape treatment processes and inclusion of synthetic records. Interpreting these model-based metrics requires understanding their assumptions and limitations with respect to adversary means and background knowledge.

More meaningful measures account for qualitative harms of identity traceability for affected groups based on socioeconomic status, minority populations, immigration factors, substance history, abuse status, disability needs and other cultural contexts that influence vulnerability irrespective of mathematical protections. Trusted access policies should offer options for verifiable rationale when requesting clearer data from data stewards who can evaluate situational sensitivity factors.

Overall responsibility falls upon custodial institutions and data safe havens to conduct contextual integrity assessments ensuring fair data flows to legitimate purposes. This means formally evaluating both welfare impacts on individuals and excluded populations, as well as potential data misuses or manipulative harms at population scale, such as discriminatory profiling. Updated governance mechanisms must address modern re-identification realities and connective threats.

Future Directions

Traditional de-identification practices struggle with handling high-dimensional, heterogeneous patient profiles across accumulating data types, modalities, apps, sensors, workflows and research studies. While valuable on their own, these techniques may fail to fully protect individuals as the ubiquity of digital traces multiplies potential quasi-identifiers. Absolute anonymity also severely limits permissible models and computations.

Emerging areas like federated analytics and differential privacy relax the goal of total de-identification by keeping raw records secured on distributed data servers and only allowing mathematical summaries to be queried from a central service so that statistical patterns can be discovered from many sites without exposing actual inputs from any one site. Legally defined LIMITED DATA SETS similarly bridge consented data access with managed identity risks for pre-vetted analysts.

Differentially private computations introduce mathematically calibrated noise to guarantee that the presence or sensitive attributes tied to any one patient will be masked across many patients. This masking allows research insights to be uncovered without revealing individual contributions. Secure multiparty computation and homomorphic encryption also enable certain restricted computations like aggregates, means and distributions executed on sensitive inputs while keeping the underlying data encrypted.

Such cryptographic methods and privacy-enhancing technologies provide complementary assurances to traditional de-identification practices. But governance, interpretation and usability remain active areas of improvement to fulfill ethical promises in practice. Holistic data safe havens must align emerging privacy-preserving computation capabilities with rigorous curation, context-based de-identification protocols and trust-based oversight mechanisms that can demonstrably justify public interest usages while preventing tangible harms to individuals and communities whose sensitive data fuels research.