Data Deep Dive

Germany Covid-19 Line List

Data up to August 13th, 2021
Completeness *$
Meldedatum (Date of Confirmation)100%
Altersgruppe (Age group)100%
Geschlecht (Gender)100%
Bundesland (Administrative level 1 geographic data)100%
IdLandkreis
(Administrative level 3 geographic data)
100%
Refdatum (Date of symptom onset)100%
AnzahlTodesfall (Number of cases where outcome was death)100%
Map of Europe with Germany hightlighted

*as of 13/08/2021;
$line list includes 99.997% of cases reported for Germany by the World Health Organization.

In the following case-study we take a deep-dive into COVID-19 line-list data from Germany, one of the >130 countries included in the Global.health platform. The case-study covers information about provenance of the data, data transformations to fit the Global.health schema and key characteristics and limitations of the data. While only addressing one country, the design of Global.health lets users quickly ask similar questions about any country included in the platform and we will discuss how users can conduct such an investigation on their own. Stay tuned for more of these data deep dives coming up. Here are other examples for Argentina and Brazil.

 

Germany is one of a few countries in Europe publishing individual level line list data consistently (Czech Republic, Scotland, Zurich/Switzerland, Israel and Estonia are among the others; see a regional map of cases here). The UK released one dataset during the emergence of the Alpha variant.

01.
What is the provenance of the data?

The Robert Koch Institute (RKI) collects and shares individual level case data for COVID-19 through a website and a data visualization dashboard. The data are updated daily, and transmitted to the RKI from the local German health authorities (Gesundheitsämter). The data are processed at the RKI once a day at midnight, and updated in the early hours of the morning. The dataset was first published on the 18th of March 2020, and the first recorded case in the dataset is from the 2nd of January 2020. A data dictionary and further information can be found here.

02.
Where can I find the original data and how is the data transformed?

Raw data can be downloaded here and details of our parser that transforms the data to our standard schema can be found here. We geocode cases within Germany down to administrative level 3 using data from the “Nationale Plattform für geographische Daten” to map Landkreis names to longitude and latitude information. We delete and fully re-ingest this dataset weekly on a Sunday to ensure that all cases are updated; note that this means the data may be up to one week out of date. We do this because there are no Unique IDs provided.

03.
How complete is the data compared to aggregated data sources?

On August 13th, 2021 the RKI dataset included 3,810,514 cumulative cases. This corresponds to 99.997% of those reported by the World Health Organization (WHO) on that day (3,810,641).

04.
Key characteristics and limitations of this database:

The RKI line list dataset from Germany provides 17 metadata fields for each patient. No unique ID is provided per patient. The reporting date (‘Meldedatum’) is used as the date of confirmation of the patient in our database, and corresponds to the date when the German Federal Ministry of Health became aware of the case. The RKI will only publish cases which not only have clinical symptoms but which are also confirmed using a laboratory test. The Global.health ingestion process will automatically ignore cases which are earlier than the earliest allowed date of 1st November 2019. According to the WHO the first COVID-19 case in Germany was detected on January 28th 2020, but the dataset contains one case with a confirmation date before this but after the earliest allowable date (2nd January 2020).

 

The dataset contains geographic information regarding administrative levels 1 (‘Bundesland’) and 3 (‘IdLandkreis’, ‘Landkreis’). Although level 3 information is reported both as a name (‘Landkreis’) and code (‘IdLandkreis’), the code is used for geocoding to avoid issues associated with differential spelling. Some entries have leading zeros which must be removed in order to match with the data obtained from the Nationale Plattform für geographische Daten. Where no match is found, location to administrative level 1 is geocoded using Mapbox. Note that this includes all subdivisions within Berlin as these are officially administrative area level 2, and thereby are not present in the dataset obtained from the Nationale Plattform für geographische Daten. Berlin cases are therefore all geocoded to level 1, namely ‘Berlin’. The cases are assigned a location by RKI according to the federal state or district where the case was reported. This usually corresponds to the place of residence or habitual residence of the patient and not necessarily the place where the person is likely to have been infected. 

 

The RKI dataset groups cases with identical metadata entries together and then reports on the total count of cases with this set of attributes (‘AnzahlFall’). This number is then used by our parser to enter the corresponding number of identical cases into the database. The outcome of the cases can be one of death (‘AnzahlTodesfall’) or recovery (‘AnzahlGenesen’). Because of case grouping the number of recovered or dead patients will always be identical to the total number of cases in that group (i.e ‘AnzahlFall’ will always be identical to either ‘AnzahlTodesfall’ or ‘AnzahlGenesen’).



05.
How to filter, view, and download this data:

To access the most up to date data described above please follow this link and see a visualisation of this data.

Signature & Contact

Anya Lindström Battle

Anya Lindström Battle

Data Scientist,
University of Oxford
on behalf of the Global.health team

In Development

Currently in development, launching early 2021.