Data Deep Dive

United States Covid-19 Line List

Data up to August 26th, 2021
Completeness *$
cdc_case_earliest_dt
(see explanation below)
100%
age_group99.9997%
race_ethnicity_combined99.99994%
Onset_dt
(Date of symptom onset)
46.3%
Icu_yn
(ICU admission)
100%
Death_yn
(Did the patient die as a result of this illness)
100%
Medcond_yn
(Pre-existing medical conditions)
45.8%
USA Map

* as of 26/08/2021;

 $ line list includes 67% of cases reported for the USA by the World Health Organization; completeness metrics computed for all cases, including those classed as probable.

In the following case-study we take a deep-dive into COVID-19 line-list data from the USA, one of the >130 countries included in the Global.health platform. The case-study covers information about provenance of the data, data transformations to fit the Global.health schema, and key characteristics and limitations of the data. While only addressing one country, the design of Global.health lets users quickly ask similar questions about any country included in the platform. We will discuss how users can conduct such an investigation on their own. Stay tuned for more of these data deep dives coming up. Here are other examples for Peru, Colombia, and Brazil.

 

The epidemic dynamics in the United States varied considerably due to differences in reporting, transmission dynamics and availability of testing. The individual level database described here was an attempt by the public health authority to consolidate this varied reporting into one standardised schema. It allowed modelling across counties and states and was used to estimate the health burden of SARS-CoV-2 in the US (https://www.nature.com/articles/s41586-021-03914-4). Further, the database is the largest global database on individual level case data (currently > 25M cases).

01.
What is the provenance of the data?

The Centres for Disease Control and Prevention (CDC), through the CDC Case Surveillance Task Force, collates and makes publicly available a dataset concerning COVID-19 cases in the USA. The CDC’s COVID-19 case surveillance database collates individual-level data reported to US states, autonomous reporting entities, US territories and affiliates. These jurisdictions report the case surveillance data voluntarily to the CDC. The dataset is created using the CDC’s operational Policy on Public Health Research and Non Research Data Management and Access and includes protections designed to protect individual privacy. It is openly available through a website as a downloadable dataset, and visualizations can also be accessed. This dataset was created on May 15th 2020, and is updated every two weeks. The first recorded confirmed COVID-19 case in the dataset is from 1st January 2020. A data dictionary and further information can be found here. The form used by the CDC to report cases of COVID-19 can be found here.

02.
Where can I find the original data and how is the data transformed?

Raw data can be downloaded here and details of our parser that transforms the data to our standard schema can be found here. The CDC USA dataset contains no geographic information below country-level. As such all cases are hard-coded to the same latitude and longitude (the centroid of the USA as obtained from Google maps). We delete and re-ingest this whole dataset weekly on Sundays, to ensure that all cases are updated and no duplicates exist. 

03.
How complete is the data compared to aggregated data sources?

When compared to the World Health Organization’s (WHO) tally, on August 26th 2021 the dataset included 25,529,092 confirmed COVID-19 cases compared to the WHO’s 37,988,983, approximately 67% of the reported total number of cases. 

04.
Key characteristics and limitations of this database:

The CDC line list dataset from the USA provides 12 metadata fields for each patient. No unique ID is provided per patient. The date used as the date of confirmation is ‘cdc_case_earliest_dt’ which is the best available date from the set of dates related to illness/specimen collection (i.e. ‘pos_spec_dt’) or the date received by the CDC (i.e. ‘cdc_report_dt’). This ensures the date variable has optimized completeness, as the logic of ‘cdc_case_earliest_dt’ is to use the non-null date of one variable when the other is null and to use the earliest valid date when both dates are available. If no date is available then it is left blank. Although the earliest reported COVID-19 case date for the USA is the 20th of January 2020, the dataset includes 3175 cases reported between the 1st and the 20th of January 2020. 

 

The CDC dataset contains cases which are either a ‘Probable Case’ or a ‘Laboratory-confirmed case’; we only consider the laboratory confirmed cases as confirmed and do not ingest any probable cases. The CDC laboratory criteria for diagnosis found here describes a probable case as meeting the clinical criteria for severe respiratory illness and the epidemiological criteria for likely exposure to SARS-CoV-2. A confirmed case is a clinically compatible illness that is laboratory confirmed. 

 

When a variable is coded to the value ‘Unknown’, jurisdictions have specified in the case data submitted to the CDC that the value is unknown. When it is ‘Missing’, the jurisdiction did not provide a value to the CDC. When the value is ‘NA’ it has been suppressed as part of privacy protections. Data cells are suppressed for low frequency (<5) records and indirect identifiers. Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). The age group categorizations were populated by the CDC using the age value that was reported on the case report form. Date of birth was used to fill in missing/unknown age values using the difference in time between date of birth and onset date. If more than race was reported, race was categorized into multiple/other races.


One of the major limitations of this dataset is the lack of geographic information; for this reason every case is geo-located to the centroid of the USA. The CDC does provide a dataset with geographic information to the county level, which also includes information about known exposure, symptom status, and process through which the case was identified (e.g. contact tracing). This dataset, however, only provides a case confirmation date to the closest month.

05.
How to filter, view, and download this data:

To access the most up to date data described above please follow this link. You can also access a visualisation of these data on our Map application.

Signature & Contact

Anya Lindström Battle

Anya Lindström Battle

Data Scientist and Researcher,
University of Oxford
on behalf of the Global.health team

In Development

Currently in development, launching early 2021.