Data Deep Dive

Colombia Covid-19 Line List

Data up to June 6, 2021
Completeness *$ (Case ID)100%
Fecha de diagnóstico
Date of diagnosis
Fecha de inicio de síntomas
(Date of symptom onset)
(Municipality & Department)
Edad (age) and Sexo (sex) 100%
Recuperado (Outcome)99.68%
Pertenencia étnica
*as of 06/06/2021; $ line list includes 100% of cases reported for Colombia by the World Health Organization

In the following case-study we take a deep-dive into COVID-19 line-list data from Colombia, one of the >130 countries included in the platform. The case-study covers information about provenance of the data, data transformations to fit the schema and key characteristics and limitations of the data. While only addressing one country, the design of lets users quickly ask similar questions about any country included in the platform and we will discuss how users can conduct such an investigation on their own. Stay tuned for more of these data deep dives coming up. Here is another example for Peru.

What is the provenance of the data?

The National Institute of Health of Colombia (Instituto Nacional de Salud) collects and shares individual level case data for COVID-19 through an interactive website hosted by the Ministry of Information Technologies and Communications (MinTIC). Their platform was launched on March 27th, 2020, and data are updated daily. The first recorded case in the dataset is from March 6th, 2020. The metadata fields (n = 23) were updated on 29th October, 2020, and have remained consistent since then. A data dictionary and further information can be found here.

Where can I find the original data and how is the data transformed?

Raw data can be downloaded on the link here and details of our parser that transforms the data to our standard schema can be found here. We geocode cases by adding centroids (latitude and longitude) through a manual lookup table, which is provided by ESRI Colombia. We ingest this database once per day and check for any updates on previously ingested cases going back one month.

How complete is the data compared to aggregated data sources?

It appears that the individual level case data provided by the Colombian Ministry of Health are complete compared to aggregated data provided to the World Health Organization (WHO): for example on June 6th, 2021 the dataset included 3,547,017 records which is 100% of those reported by WHO on that day (3,547,017).

Key characteristics and limitations of this database:

The line list dataset from Colombia provides 23 metadata fields for each patient. These include a unique ID (‘ID de caso’) which helps track each case through time. Geographic metadata are provided and include the Department and Municipality where the case was reported. Date of Diagnosis ‘Fecha de diagnóstico’) is provided in 99.88% of cases, which we use as Date of Confirmation. In the cases where this is absent, we use Date Reported Online (‘fecha reporte web’). We have no information on the type of test used to diagnose COVID-19, but we are provided with the method used to confirm recovery (‘Tipo de recuperación’) in 92.77% of cases. This can be either ‘time’, i.e., 28 days has elapsed with no further symptoms, or a negative PCR test.


A strength of this dataset is the number of events with associated dates recorded: we are provided with date of symptom onset, date of diagnosis and ultimate outcome in over 88% of cases. We are also provided with information on severity of symptoms: ‘Estado’ is completed for 99.62% of cases as one of light, moderate, serious or death; ‘Ubicacion del caso’ (case location) also records whether each case was last recorded at home (‘Casa’), in hospital (‘Hospital’) or in intensive care (‘Hospital UCI’). Along with age and sex being documented for 100% of cases, the ethnicity (‘Pertenencia étnica’) of 98.33% of cases is recorded. The particular ethnic or indigenous group (‘Nombre del grupo étnico’) within Colombia is documented for a small percentage of cases (0.98%).


The ‘Tipo de contagio’ field is documented for all cases, and separates cases based on whether they were imported (‘Importado’), transmitted within the country (‘Relacionado’ & ‘Comunitaria’), or origin unknown (‘En estudio’). For the 3020 cases listed as imported, the country of origin is documented.

How to filter, view, and download this data:

To access the most up to date data described above please follow this link and see a visualisation of this data.

Signature and Contact

Felix Jackson

Felix Jackson

Researcher and DPhil Student,
Computer Science Department,
University of Oxford
on behalf of the team.

In Development

Currently in development, launching early 2021.