Key Takeaways
- Oncology cohort definitions often require more specificity than diagnosis coding alone can provide. For example, International Classification of Diseases, Clinical Modification (ICD-CM) codes fail to distinguish lung cancer subtypes or differentiate suspected disease from clinically confirmed malignancy.
- The NashBio Structured Clinical Data (SCD), combined with the Clinical Specialty (CS) Oncology dataset, provides a launch point for cohort definition. Researchers can use the SCD to assess feasibility and define broad oncology cohorts prior to downstream refinement using other linked clinical data.
- Cancer registry data from Vanderbilt University Medical Center, included in CS Oncology, provides verified histology, staging, and grading details to enable subtype-specific discovery.
- NashBio’s oncology data is refreshed biannually and linked to the broader SCD, enabling researchers to analyze biopsy-confirmed diagnoses alongside long-term outcomes and treatment trajectories.
Introduction
Oncology research increasingly depends on large-scale real-world data (RWD) to study disease mechanisms, identify therapeutic targets, and evaluate treatment outcomes. Lung cancer remains the leading cause of cancer-related mortality in 2026 (Siegel et al., 2026) and presents a particular challenge for real-world research because broad diagnostic labels often fail to capture the underlying histologic and molecular heterogeneity of disease (Zito Marino et al., 2019). Lung cancer includes multiple biologically distinct entities with different molecular drivers, treatment approaches, and clinical outcomes. As a result, many oncology research questions depend on accurately distinguishing lung cancer subtypes (Zito Marino et al., 2019).
However, many RWD sources were not originally designed for subtype-specific oncology research (Beyrer et al., 2023). For example, International Classification of Diseases, Clinical Modification (ICD-CM) codes used for billing do not reliably distinguish histologic subtypes or identify registry-confirmed cancer cases (Hess et al., 2019). Without access to more clinically specific disease characterization, oncology cohorts may include heterogeneity that can affect subgroup analyses, treatment comparisons, and interpretation of outcomes.
In this post, we examine how curated oncology registry data and histologic confirmation available through NashBio can support more clinically specific oncology cohort definitions and longitudinal real-world research.
The Clinical Challenge: When Lung Cancer is Not Specific Enough
The distinction between lung cancer subtypes is clinically consequential. Small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) differ fundamentally in their biology, oncogenic drivers, treatment paradigms, and survival trajectories (Han et al., 2025). Within NSCLC, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) harbor distinct mutational landscapes, with actionable targets like Epidermal Growth Factor Receptor (EGFR) mutations or Anaplastic Lymphoma Kinase (ALK) rearrangements (Chevallier et al., 2021). More recently, genome-wide association studies have suggested that the genetic architecture of LUAD, LUSC, and SCLC are also largely distinct (Long et al., 2022). Aggregating these subtypes into broad diagnosis categories may introduce clinical and biological heterogeneity into research cohorts.
Standard RWD sources, including those from electronic health records (EHR), frequently rely on these broad administrative codes for patient identification (Blonde et al., 2018). This creates important limitations for researchers seeking to evaluate subtype-specific biomarkers, treatment responses, or outcomes.
ICD-10-CM codes for lung cancer describe anatomic site but not histology. However, subtype distinctions are essential for precision oncology research. For example, genomic risk architectures differ significantly across LUAD, LUSC, and SCLC (Long et al., 2022). Distinguishing these subtypes requires ICD-O-3 codes or access to pathology results.
Scoping the Cohort: The NashBio SCD as a Launch Point
The NashBio Structured Clinical Data (SCD) can serve as a useful starting point for an oncology cohort definition. Built from the de-identified EHR of Vanderbilt University Medical Center (VUMC), the SCD contains structured coding for diagnoses, procedures, medications, labs, vitals, and encounter data. This allows researchers to query and explore patient populations using broad demographic and diagnostic criteria to assess feasibility and refine cohort definitions before incorporating additional oncology-specific data sources. In an oncology setting, researchers can first define a broader lung cancer population before refining the cohort using histologic confirmation and curated registry data.
- The NashBio SCD enables rapid, iterative cohort exploration using demographic and diagnostic parameters before resource-intensive data retrieval begins. This strategic scoping step ensures efficient allocation of research resources and maximizes the yield of downstream pathological confirmation.
As of this post, the NashBio SCD contains ~13K patients with at least one ICD-10-CM diagnosis code of C34: Malignant neoplasm of bronchus and lung.
Defining the Cohort: Leveraging Biopsy-Proven Registry Data
Where the SCD supports broad cohort identification, NashBio’s integration of cancer registry data, available with the CS Oncology, adds clinically curated histology, staging, and disease characterization. NashBio’s Clinical Specialty (CS) Oncology data enriches longitudinal EHR records with expert-abstracted cancer registry data to reveal the clinical signals driving disease progression and treatment response. This registry data is generated through oncology care delivered at Vanderbilt-Ingram Cancer Center (VICC), a National Cancer Institute (NCI)-designated Comprehensive Cancer Center.
By accessing this registry layer, researchers gain additional clinical specificity through:
- Verified histology: confirmation of tumor histology based on pathology review and biopsy findings
- Pathological staging: precise TNM (Tumor, Node, Metastasis) classification at diagnosis based on surgical pathology and clinical evaluation
- Tumor grade: assessment of cellular differentiation and abnormality associated with tumor aggressiveness and metastatic behavior
Together, these data elements support a more clinically specific phenotype than diagnosis coding alone.
- Supplementing the NashBio SCD with cancer registry data generated from an NCI-designated Comprehensive Cancer Center adds clinically curated histology, staging, and grading information not consistently captured through administrative coding alone.
Looking Ahead: Longitudinal Tracking and the Patient Journey
NashBio addresses limitations associated with diagnosis-code-based cohort definitions in oncology research by supplementing EHR-derived SCD with curated cancer registry data. The dataset is refreshed biannually, enabling researchers to evaluate histologically characterized oncology cohorts alongside treatments, laboratory results, imaging, comorbidities, and clinical outcomes over time.
In our next post, we will explore how these data sources can support reconstruction of the patient journey from pre-diagnostic evaluation through treatment and follow-up.
References
- Beyrer, J., Nelson, D.R., Sheffield, K.M., Huang, Y.-J., Lau, Y.-K., Hincapie, A.L., 2023. Development and Validation of Coding Algorithms to Identify Patients with Incident Non-Small Cell Lung Cancer in United States Healthcare Claims Data. Clin Epidemiol 15, 73–89. https://doi.org/10.2147/CLEP.S389824
- Blonde, L., Khunti, K., Harris, S.B., Meizinger, C., Skolnik, N.S., 2018. Interpretation and Impact of Real-World Clinical Data for the Practicing Clinician. Adv Ther 35, 1763–1774. https://doi.org/10.1007/s12325-018-0805-y
- Chevallier, M., Borgeaud, M., Addeo, A., Friedlaender, A., 2021. Oncogenic driver mutations in non-small cell lung cancer: Past, present and future. World J Clin Oncol 12, 217–237. https://doi.org/10.5306/wjco.v12.i4.217
- Han, J., Yang, Z., Zhao, H., 2025. Lung cancer immunotherapy in 2025: where we stand and what comes next? Front Immunol 16, 1728163. https://doi.org/10.3389/fimmu.2025.1728163
- Hess, L.M., Zhu, Y.E., Sugihara, T., Fang, Y., Collins, N., Nicol, S., 2019. Challenges of Using ICD-9-CM and ICD-10-CM Codes for Soft-Tissue Sarcoma in Databases for Health Services Research. Perspect Health Inf Manag 16, 1a.
- Long, E., Patel, H., Byun, J., Amos, C.I., Choi, J., 2022. Functional studies of lung cancer GWAS beyond association. Hum Mol Genet 31, R22–R36. https://doi.org/10.1093/hmg/ddac140
- Siegel, R.L., Kratzer, T.B., Wagle, N.S., Sung, H., Jemal, A., 2026. Cancer statistics, 2026. CA Cancer J Clin 76, e70043. https://doi.org/10.3322/caac.70043
- Zito Marino, F., Bianco, R., Accardo, M., Ronchi, A., Cozzolino, I., Morgillo, F., Rossi, G., Franco, R., 2019. Molecular heterogeneity in lung cancer: from mechanisms of origin to clinical implications. Int J Med Sci 16, 981–989. https://doi.org/10.7150/ijms.34739
