Key Takeaways:

  • Persistent data fragmentation and lack of interoperability remain key barriers to realizing the full value of real-world clinical data.
  • Privacy-preserving linkage methods, such as tokenization, support secure integration of de-identified data across multiple sources while minimizing privacy risks.
  • Adopting common data models and standards improves quality, portability, and fit-for-purpose use of real-world clinical data.

Introduction

The integration of real-world clinical data enables more robust analyses to inform clinical, regulatory, and health policy decisions.1 Despite ongoing industry investment in interoperability standards and data linkage technologies, aggregating real-world clinical data from multiple sources remains a significant challenge. A collaborative effort among stakeholders is needed to build scalable solutions that ensure high data quality, security, and usability. Addressing these gaps will require pragmatic, standards-driven approaches that include coordinated governance and methodological transparency.

Data Incompatibility: Fragmented Sources. Real-world clinical data originates from diverse sources, including electronic health records, pharmacy and laboratory systems, insurance claims, and disease registries. These sources often exhibit data quality challenges such as missing or illogical timestamps, misaligned clinical codes, gaps in longitudinal follow-up, and limited validation of key variables.2  Moreover, critical information often remains buried in scanned documents, complicating automated data access and analysis. These issues reflect deeper fragmentation in how health systems document, store, and govern clinical information.

To address these issues, data providers, industry researchers, and research partners should:

  • Implement consistent documentation practices across data sources to reduce downstream complexity and improve the reliability of real-world analyses.
  • Utilize common data models (OMOP, PCORnet CDM) and standards (CDISC, HL7 v2/v3, FHIR, and C-CDA) to harmonize data content storage, packaging, and exchange.3

Balancing Privacy with Linkage. Traditional approaches (i.e., those that leverage direct linkage of data sources) to linking participant-level data require sharing direct identifiers, such as personally identifying information, or the availability of a common identifier across data sources, which is often not feasible. Privacy-preserving linkage methods, such as tokenization, enhance data completeness by enabling de-identified data integration across sources, supporting longitudinal analysis while minimizing privacy risks.4,5 With tokenization, third-party tools are used to generate irreversible tokens from combinations of personally identifying information, such as first name + last name + gender + date of birth.6 While tokenization reduces the need to share direct identifiers, its accuracy can vary depending on data quality and source coverage. It also does not address underlying gaps in data completeness and may underperform in populations with less consistent or fragmented records.

To ensure accuracy and integrity with tokenization, transparent documentation of data provenance, linkage methods, and quality checks is essential, as linkage approaches such as deterministic (exact matches) or probabilistic (partial matches) impact accuracy.4,5 Looking ahead, concept-stage models combining blockchain technology and self-sovereign identity have been proposed to support tokenization and privacy management within Health Information Exchanges.7 While promising, these approaches remain in early stages and require further evaluation before widespread, real-world implementation.

Conclusion

Aggregated real-world clinical data enables scalable, fit-for-purpose applications across research, analytics, and life sciences. Achieving this requires standardized data models, interoperable frameworks, and privacy-preserving linkage methods such as tokenization. Ultimately, reducing fragmentation in real-world clinical data will require sustained collaboration across data providers, research partners, and industry stakeholders.

References

  1. Penberthy LT, Rivera DR, Lund JL, Bruno MA, Meyer AM. An overview of real-world data sources for oncology and considerations for research. CA Cancer J Clin. 2022;72(3):287-300. doi:10.3322/caac.21714
  2. Pfaffenlehner M, Behrens M, Zöller D, et al. Methodological challenges using routine clinical care data for real-world evidence: a rapid review utilizing a systematic literature search and focus group  discussion. BMC Med Res Methodol. 2025;25(1):8. doi:10.1186/s12874-024-02440-x
  3. Finster M, Wenzel M, Taghizadeh E. Common data models and data standards for tabular health data: a systematic review. BMC Med Inform Decis Mak. 2025;25(1):422. doi:10.1186/s12911-025-03267-2
  4. U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research (CDER) and Center for Biologics Evaluation and Research (CBER) and Oncology Center for Excellence (OCE). Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products. Guidance for Industry. July 25, 2024. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory
  5. Bernstam EV, Applegate RJ, Yu A, et al. Real-World Matching Performance of Deidentified Record-Linking Tokens. Appl Clin Inform. 2022;13(4):865-873. doi:10.1055/a-1910-4154
  6. Walters C, Langlais CS, Oakkar EE, Hoogendoorn WE, Coutcher JB, Van Zandt M. Implementing tokenization in clinical research to expand real-world insights. Front Drug Saf Regul. 2025;5:1519307. doi:10.3389/fdsfr.2025.1519307
  7. Zhuang Y, Shyu CR, Hong S, Li P, Zhang L. Self-sovereign identity empowered non-fungible patient tokenization for health information exchange using blockchain technology. Comput Biol Med. 2023;157:106778. doi:10.1016/j.compbiomed.2023.106778