Identifying and Understanding Data Quality Issues in a Pediatric Distributed Research Network

Sunday, October 25, 2015: 11:30 AM
143B-C (Walter E. Washington Convention Center)
Levon Haig Utidjian, MD1, Ritu Khare, PhD1, Evanette K. Burrows, BS1, Greg Schulte, MS2, Sara J. Deakyne, MPH3, Keith Marsolo, PhD4, Megan M. Reynolds, BBA5, Richard R. Hoyt, BS5, Nandan Patibandla, MS6 and L. Charles Bailey, MD, PhD1, (1)The Children's Hospital of Philadelphia, Philadelphia, PA, (2)Children’s Hospital Colorado, Aurora, CO, (3)Pediatric Emergency Medicine, Children's Hospital Colorado, Aurora, CO, (4)Cincinnati Children's Hospital Medical Center, Cincinnati, OH, (5)Nationwide Children's Hospital, Columbus, OH, (6)Boston Children's Hospital, Boston, MA


Collaborations across multiple institutions are essential to achieving adequate cohort sizes in pediatric research. PEDSnet is a new clinical data research network (CDRN) that aggregates electronic health record (EHR) data from eight of the nation's largest children's hospitals. In order for PEDSnet to support comparative effectiveness research, a prerequisite is ensuring the network's data is "high quality." The main challenges include the lack of EHR data's focus on immediate research use, semantic heterogeneity across systems, and data peculiarities in pediatric health data. We demonstrate the use of a comprehensive set of validity checks to identify, understand, and report a range of data quality issues (Table 1) in PEDSnet.


PEDSnet uses the OMOP Common Data Model (CDM), a widely accepted schema for observational medical data. Each partner site performed extract-transform-load (ETL) operations according to network-wide conventions. We focused on attribute-level data quality assessment, and developed data analysis scripts to ensure adherence to the CDM, perform data domain checks, and compute frequency distributions. We executed the scripts on each site's data and identified various data quality issues. Issues were classified as an "ETL issue" denoting an ETL logic error, or a "provenance issue" denoting a pediatric data anomaly or data entry error.


To date, we have collected data from six sites, representing 3.6 million children and over 75 million encounters, and constituting over 80% of the final network. Table 2 organizes the total number of issues by data quality dimensions. Issues discovered by our methods include variations in gestational age and significant differences in vital sign data such as weight and blood pressure measurements.


A key challenge in building any CDRN is defining and achieving an optimal degree of data quality. Despite defining network-wide conventions, sharing ETL scripts and organizing regular web conferences, we identified over 200 data quality issues across six sites. This strongly suggests that proactive project management and documentation are not sufficient to ensure data validity in a CDRN. Our future work includes operationalizing advanced assessment levels, extending to all partner sites and more clinical domains, and reporting longitudinal results.

Table 1. Data Quality Dimensions



Fidelity (i.e. reliability)

the degree to which PEDSnet data correctly reflects source system data

Consistency (i.e. internal validity)

the degree to which a specific type of information in PEDSnet is recorded in the same way across different sources

Accuracy (i.e. external validity)

the degree to which PEDSnet data accurately reflects patient clinical characteristics

Completeness (i.e. feasibility)

the degree to which a piece of information is collected and available in PEDSnet

Table 2. Total Number of Data Quality Issues





ETL issue





Provenance issue