Global.health is the first of its kind, easy-to-use global repository and visualization platform that enables open access to real-time de-identified epidemiological line-list case data, that was developed during the COVID-19 pandemic. The project initially started with a group of volunteers recording data on spreadsheets, but that quickly ran into scaling issues. This led to the development of an platform built on MongoDB, Node.js and Python that now contains 100m+ records in a uniform schema, imported daily from official and authoritative sources.
This talk will describe the evolution of our data infrastructure, and in particular, will focus on (i) challenges that we faced in developing and extending the data schema (ii) tackling inefficiencies in large dataset search and filter, and (iii) addressing privacy concerns in importing datasets with potentially identifiable information.