Key Takeaways:

  • Cardiac procedures are a rich but underutilized data source: Echocardiograms and electrocardiograms (ECGs) generate detailed quantitative and qualitative measures of cardiac structure and function, but these features are often buried in unstructured clinical Electronic Health Record (EHR)-derived reports and require deliberate extraction to be usable for research.
  • NLP enables efficient extraction of cardiac phenotypes from EHRs: Vanderbilt University Medical Center (VUMC) researchers demonstrated that natural language processing (NLP) can parse echocardiographic variables from EHR reports accurately and efficiently, offering a scalable, cost-effective approach to studying heart structure and function in diverse populations.
  • Introducing CS Cardiology: NashBio’s Clinical Specialty (CS) Cardiology dataset delivers extracted echocardiogram features, ECG features, and curated biomarkers, supporting the flexible definition and analysis of cardiovascular outcomes across a population of 4M+ individuals in a tabular, analysis-ready format.

From Endpoints to Procedures: A Recap

Our previous blog explored cardiovascular disease endpoints, the clinical events such as myocardial infarction, heart failure, and stroke that serve as the primary outcomes in cardiovascular studies, and how precisely defining these endpoints in large real-world datasets is critical for meaningful research. In a second blog, we explored how genome-wide association studies and polygenic risk scores are illuminating the heritability of cardiovascular disease and enabling genotype-phenotype investigations at scale. Together, these blogs establish that robust cardiovascular disease research requires both well-defined outcomes and genetic context.

To fully characterize the underlying cardiac phenotype, direct measures of cardiac structure and function are also essential. These measurements are obtained through routine clinical procedures that are integral to standard patient care. Two of the most widely utilized and information-rich cardiac assessments are echocardiograms and ECGs. Echocardiograms provide imaging-based quantification of cardiac chamber geometry, valvular morphology, and systolic and diastolic function, while ECGs capture the electrical activity of the heart, including conduction intervals, rhythm abnormalities, and other electrophysiologic features.1,2 Critically, the rich quantitative and qualitative features generated by these procedures are often embedded within unstructured, narrative clinical reports, locked away in the EHR and largely inaccessible for large-scale research without deliberate extraction efforts.3 As the volume of routinely collected clinical data continues to grow, so too does the opportunity to develop scalable methods that can unlock these records for research purposes.

Extraction of Echocardiographic Features with Natural Language Processing

To address the challenges of unstructured and inaccessible EHR data, VUMC researchers developed a method to extract and filter quantitative echocardiographic variables from the EAGLE BioVU cohort, a large, racially diverse EHR-based dataset.4 Reports stored in PDF format within the VUMC EHR were parsed using NLP to extract numeric values for six specific structural parameters: left ventricular septal thickness, left ventricular posterior wall thicknesses, left ventricular end systolic diameter, left ventricular end diastolic diameter, left atrial diameter, and aortic root diameter.4 Prior to their analysis, the extracted data required systematic filtering due to inconsistent use of measurement units and transcriptional errors that could compromise data integrity.4

The VUMC extraction approach was highly accurate and efficient.4

Of 36,456 potential data points, 91% were present. The post-extraction filtering method required manual review of fewer than 1% of data points, achieving ~90% accuracy in error identification and ~100% sensitivity in error detection.4

The resulting cohort of 2,834 unique adult subjects, predominantly African American, was comparable in size to, and in some cases larger than, established community-based cohorts such as the Atherosclerosis Risk in Communities (ARIC) cohort, the Coronary Artery Risk Development in Young Adults Study (CARDIA), and the Jackson Heart Study (JHS), with similar demographic and echocardiographic characteristics.4

These findings highlight the value of leveraging EHRs as a rich, large-scale data source, demonstrating that clinic-based EHR cohorts can serve as important and complementary resources for epidemiologic and genotype-phenotype studies of cardiac structure and function.4

Introducing CS Cardiology: Deep Cardiac Phenotyping at Population Scale

Building directly on this scientific foundation, NashBio has developed its Clinical Specialty (CS) offering, a suite of enriched, clinically curated datasets that extend the NashBio Structured Clinical Data (NB SCD) with specialty-specific variables designed to enable deep phenotyping. CS Cardiology is a purpose-built dataset for cardiovascular researchers, bringing the rigorous, scalable data extraction and curation methods showcased in the VUMC research article into a research-ready solution. This dataset is designed to empower researchers pursuing questions across the cardiovascular spectrum, from identifying novel genetic determinants of cardiac structure and function, to building predictive models of heart failure and arrhythmia.

CS Cardiology includes:

  • 75+ echocardiographic measures
    Parsed from EHR reports including left atrial diameter, ejection fraction, chamber dimensions, and wall thickness measurements, all delivered in a research-ready format.
  • 25+ ECG measures
    Include PR interval, QRS duration, QTc interval, and clinical impression, enabling both quantitative and qualitative cardiac electrical phenotyping.
  • 67M+ rows of data
    Including curated laboratory biomarkers with established roles in heart failure risk stratification, atherosclerotic cardiovascular disease, and systemic inflammation, including B-type natriuretic peptide (BNP), lipoprotein(a) (Lp(a)), and high-sensitivity C-reactive protein (hs-CRP).5-7

What makes CS Cardiology distinct?

  • Data are provided in a tabular format that can be directly linked to the NB SCD, enabling integration with genomic and longitudinal clinical data.
  • CS Cardiology is not restricted to a specific disease cohort. Cardiovascular measures and outcome-relevant data are available across the broader 4M+ NB SCD population, enabling population-level analyses. While availability of specific measures varies based on clinical care patterns, this structure allows researchers to move beyond cohorts defined solely by diagnosed cardiovascular disease and instead examine subclinical variation, risk factor trajectories, and disease incidence across a large real-world population.
  • Curated cardiology measurements accompany the extracted features, reducing the preprocessing burden on research teams.

Contact NashBio today to learn how CS Cardiology can accelerate your cardiovascular breakthrough.

References

  1. Taub CC, Stainback RF, Abraham T, et al. Guidelines for the Standardization of Adult Echocardiography Reporting: Recommendations From the American Society of Echocardiography. J Am Soc Echocardiogr. 2025;38(9):735-774. doi:10.1016/j.echo.2025.06.001
  2. Sattar Y, Chhabra L. Electrocardiogram. In: StatPearls. StatPearls Publishing; 2026.
  3. Reading Turchioe M, Volodarskiy A, Pathak J, Wright DN, Tcheng JE, Slotwiner D. Systematic review of current natural language processing methods and applications in cardiology. Heart. 2022;108(12):909-916. doi:10.1136/heartjnl-2021-319769
  4. Wells QS, Farber-Eger E, Crawford DC. Extraction of echocardiographic data from the electronic medical record is a rapid and efficient method for study of cardiac structure and function. J Clin Bioinforma. 2014;4:12. doi:10.1186/2043-9113-4-12
  5. Cao Z, Jia Y, Zhu B. BNP and NT-proBNP as Diagnostic Biomarkers for Cardiac Dysfunction in Both Clinical and Forensic Medicine. Int J Mol Sci. 2019;20(8). doi:10.3390/ijms20081820
  6. Reyes-Soffer G, Ginsberg HN, Berglund L, et al. Lipoprotein(a): A Genetically Determined, Causal, and Prevalent Risk Factor for Atherosclerotic Cardiovascular Disease: A Scientific Statement From the American Heart Association. Arterioscler Thromb Vasc Biol. 2022;42(1):e48-e60. doi:10.1161/ATV.0000000000000147
  7. Kurt B, Reugels M, Schneider KM, et al. C-reactive protein and cardiovascular risk in the general population. Eur Heart J. Published online December 11, 2025:ehaf937. doi:10.1093/eurheartj/ehaf937