Data science vs the COVID-19 pandemic: Flattening the curve -- but how?

Whether they are epidemiologists or not, a few people have attempted to use data and predictive models to model the COVID-19 pandemic. Let's look at the models, the data, and the assumptions and implications that come with them
Written by George Anadiotis, Contributing Writer on

While things change from day to day, by now we may have enough data, models, and opinions, to make some data-driven observations on how the COVID-19 pandemic is spreading. Perhaps, more importantly, we can venture on what it will take to stop it.

The COVID-19 virus was first detected in late 2019 in China. Since then, it seems like its has stopped spreading in China, while unfortunately it is in different stages of development around the rest of the world. It is questionable whether the data available at this point is enough to draw conclusions. 

Machine learning prediction experts from Carnegie Mellon University working on COVID-19 forecasts, for example, acknowledge there is far more uncertainty than usual. Still, they believe their work will be worthwhile in informing the CDC and improving the agency's preparation. Let's look at how different people use data for their analyses around the world and try and draw from their insights.

Analysis #1: It's exponential, or why you must act now

40 million views and 28 translations in a week is a lot, even for an article on a matter of life and death. The author of this Medium post, Tomas Pueyo, is not an epidemiologist. This, however, does not necessarily mean his analysis on epidemiology data is flawed. If nothing else, it is pretty dense, looks convincing, and has been lauded by some health experts and scientists.

Pueyo's analysis exemplifies what has come to be known as the "flattening the curve" approach. The bottom line of the analysis is that COVID-19 is a pandemic now, so it can't be eliminated. But what can be done is to reduce its impact. Mostly everyone will get infected, so the goal should be to have as few people infected at the same time as possible.

The analysis draws from data in places like Taiwan and Korea, where this approach was adopted early on, and adhered to. The way to do this is by social distancing, and the faster this happens, the more effective it will be, according to this analysis.

The key idea behind the notion of flattening the COVID-19 spread curve is to make sure not everyone gets infected at the same time. This way the healthcare infrastructure can have better chances of coping.

What underpins the analysis is data from a number of scientific publications or pre-prints. The distinction is important here. The scientific publication process is known to suffer from a number of issues, with what we would call time to market being prominent among them.

The peer review and publication process can take anywhere from a few months to a few years to complete. This means that in cases like this, where data availability is important, the process is often sidestepped in favor of either non-vetted, but readily available data, or scientific paper pre-prints.

Data can be accessed via dashboards and data hubs created by various organizations, ranging from governments to private enterprises and volunteers. Scientific paper pre-preprints can be found at hubs like Arxiv or Zenodo, enabling researchers to share their findings immediately.

Those sources differ from the official ones in some important ways. The data and findings shared through those do not come with official endorsement, unless otherwise stated, and have not gone through a peer review process. This does not necessarily make them untrustworthy, but it does mean they should be critically evaluated.

Analysis # 2: It's not exponential, or herd immunity

A key assumption underlying the "act now" analysis is that the COVID-19 infection rate follows what is called an exponential curve. We have seen this assumption being challenged, however. Let's be clear -- nowhere have we seen any serious analyses challenging the fact that social distancing is a necessary measure at times like these. This is about something more nuanced.

What people like Richard Baldwin and Thomas House are pointing out is that, technically speaking, the COVID-19 infection rate curve is not exponential. Rather, they point out, it follows the epidemiology curve. While exponential curves keep rising, epidemiology curves rise to a peak, then fall, then may have a second peak.

This has to do with whether social distancing and other related measures continue to be enforced. If not, a second wave of infection may occur. At this point, however, it seems that the analyses part ways and reach different conclusions.


Simulation of COVID-19's new case evolution in 2020. Source: Anderson et al. (2020)

Baldwin, a Professor of International Economics at The Graduate Institute in Geneva, notes that curve-flattening policies have immediate economic consequences. He consequently embarks on analyses of how governments could respond, as well as rising inequality in the US, to conclude the pandemic may result in social upheaval.

House, a reader in mathematics at Manchester University, tested his model of the impact of social distancing measures lasting three weeks, as reported by Sky News. His findings suggest that if measures were started 40 days into the outbreak the total number of cases in the subsequent few weeks would be significantly lower than if they were started later.

But the model suggests cases would rise rapidly once the measures were relaxed, in effect simply delaying the peak in cases. By contrast, bringing in the same measures later in the outbreak resulted in a second wave of cases, but the peak for each was lower. It cut in half the maximum number of people who were sick at any moment in time and dramatically cut the total infected, which early interventions failed to do.

House concludes that delaying action can allow immunity in the population to build up, reducing the number of people vulnerable to infection. Sky News notes that similar modelling is likely to underpin the UK government strategy, which may explain why it has simply urged those with symptoms to stay at home, while other countries have been more aggressive in their approach, closing bars or banning public gatherings.

It has been noted, however, that the phrase herd immunity has been misinterpreted. Graham Medley from the London School of Hygiene and Tropical Medicine, who chairs a group of scientists who model the spread of infectious diseases and advise the government on pandemic responses, says that the actual goal is the same as that of other countries: flatten the curve by staggering the onset of infections. As a consequence, the nation may achieve herd immunity; it's a side effect, not an aim.  

Smarter COVID-19 decision making and resources

If there is one lesson to draw from reading those analyses side by side it would be that it's complicated. Decision making, even aided by data science and analytics, is not straightforward, especially in a field as foreign to most of us as epidemiology. Some rules for data-driven decision making still apply, however.

Cassie Kozyrkov is Head of Decision Intelligence at Google. She recently wrote a Medium article on smarter COVID-19 decision making. Kozyrkov does not pretend to be an epidemiologist, and does not call to action. Instead, her aim is to share with the world a sound, generic decision-making process, driven by specific steps, criteria, and data.

Coronavirus disease COVID-19 infection medical. New official name for Coronavirus disease named COVID-19, pandemic risk on world map background

Epidemiology is a domain most of us know very little about. But data-driven decision making principles can still help cut through the noise at times like these

Getty Images/iStockphoto

Kozyrkov's methodology is composed of six steps: 

  • Facing your irrationality
  • Understanding yourself and setting objectives
  • Considering potential actions
  • Choosing action triggers
  • Choosing minimum quality of sources,
  • Gathering information, and acting -- or not

In a way, this is DataOps for the people. Every step of this methodology corresponds to a step in data-driven decision making applied at the organization level. From changing organizational culture, to setting objectives, to practicing data governance and datasource evaluation. Being aware of the process and acting accordingly can be beneficial at the individual level, too.

Pueyo also followed this recipe, in a way: "What I did was aggregate the opinions of experts," he said. "Everything I have is from the raw data or analysis from other people."

Wrapping up, here are some resources you may find useful if you want to be up to date with the latest data, or get more involved in using technology to help with the COVID-19 pandemic.


Remote working vs back to the office: Benefits are clear, but there could be trouble ahead for some
A middle aged man in casual attire sat at his computer desk speaking to colleagues via a split-screen video chat application

Remote working vs back to the office: Benefits are clear, but there could be trouble ahead for some

Professional Development
Cancer therapies depend on dizzying amounts of data: Here's how it's getting sorted in the cloud

Cancer therapies depend on dizzying amounts of data: Here's how it's getting sorted in the cloud

The company that's covering for AT&T's failures (Verizon's and T-Mobile's too)
Woman on cellphone.

The company that's covering for AT&T's failures (Verizon's and T-Mobile's too)

Mobile Carriers