Quality
Christina M. Stuart, MD (she/her/hers)
Resident Physician
University of Colorado, Department of Surgery
Denver, Colorado, United States
Christina M. Stuart, MD (she/her/hers)
Resident Physician
University of Colorado, Department of Surgery
Denver, Colorado, United States
Christina M. Stuart, MD (she/her/hers)
Resident Physician
University of Colorado, Department of Surgery
Denver, Colorado, United States
Yizhou Fei, MS
Analyst
University of Colorado, Anschutz Medical Campus, United States
Richard D. Schulick, MD, MBA
Chair of Surgery and Cancer Center Director
University of Colorado, Department of Surgery, United States
Kathryn L. Colborn, PhD
Associate Professor
University of Colorado School of Medicine, Adult & Child Center for Outcomes Research & Delivery Science, United States
Robert A. Meguid, MD, MPH
Professor of Surgery
University of Colorado, Department of Surgery, United States
Data quality, including the completeness of data elements, is a major consideration when working with data registries to generate clinical insights. Notably, the prevalence of missing data among patients with cancer, identified through the National Cancer Database (NCDB), has been associated with heterogeneous differences in overall survival and as such has marked implications for clinical care and research. The objective of this study was to enhance the manually abstracted NCDB by decreasing rates of missing data for key variables and adding new variables using automated statistical methodology.
Methods: A health system’s NCDB data for patients with primary colorectal, lung, and pancreatic cancers 2011 – 2021, was linked to electronic health record (EHR) data using personal health identifiers. Variables with frequent missingness (race, ethnicity, height, weight, and smoking status) and new variables (Eastern Cooperative Oncology Group (ECOG) score, American Society of Anesthesiologists Physical Status Classification (ASA class), functional health status, chemotherapy regimen, and surgical procedure) were identified in structured and unstructured EHR data. After incorporating the structured data from EHR, a natural language processing tool incorporating rule-based algorithms was designed to further extract variables from unstructured notes. The rule-based algorithms were written in the R programming language with the use of regular expression.
Results: A total of 6,050 patients with NCDB records were linked to their EHR data. Prior to enhancement, rates of missingness for key demographic variables ranged from 2.0% to 5.3%, see figure. Following dataset enhancement, missingness was significantly reduced across all variables, ranging from a 31.9% relative reduction in missingness for height, to a 68.0% relative reduction for smoking status. Of the new variables we added, 1,367 (22.6%) of patients gained ECOG score, 81 (20.8%) of patients gained ASA class, 1,099 (57.8%) gained chemotherapy regimen, and 979 (32.8%) gained their surgical procedure.
Conclusions: Applying statistical methodology to merged data, we were able to reduce rates of missingness in existing variables and add new variables to enrich the NCDB. While further refinement is needed to decrease missingness in new variables, this automated methodology can replace or augment manual chart review and improve the ability of the NCDB to study unanswered questions leading to clinical advancements in oncology.