On March 20, 2020, just one day after California issued a statewide stay-at-home order to counter the COVID-19 pandemic, Bin Yu answered an email about a new nonprofit organization seeking data science expertise in its efforts to distribute personal protective equipment (PPE) most effectively.
A professor of statistics and electrical engineering and computer sciences at UC Berkeley, Yu quickly recruited members from her group, including computer science Ph.D. student Chandan Singh as her deputy and a team of 12 students, for what Yu and Singh would call “a warlike two-month engagement.” The team would work with Response4Life, a nonprofit formed just a week before with the goal of helping transport PPE to where it was needed most.
In the end, the team’s predictions helped inform the shipment of at least 349,000 face shields to doctors and healthcare workers – 14,000 through Response4Life and the rest through Maker Nexus – at a time when they were desperately needed. The team credited the people at Response4Life and their own combined skills in applied statistics, machine learning, signal processing and coding, and “awesome and timely support from friends, family and colleagues.”
In an invited paper submitted to a special issue of Statistical Science on COVID-19, Yu and Singh recount their rapid-response data science team’s efforts and distill seven principles from the project. Other team members were Nick Altieri (who delayed finishing his thesis), James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Yan Shuo Tan, Tiffany Tang and Yu (Hue) Wang.
“After discussing with the Response4Life team, it became clear that this was a very unusual data-science project, with no data in hand or plans on how to find it,” the authors wrote. “Forecasts were needed to inform which hot-spot hospitals to send PPE, but this would require first finding relevant data, developing short-term (e.g., 5 days ahead) forecasting models, and integrating forecasts” into Response4Life’s Salesforce logistics platform.
A summary of the seven principles identified by the team follows.
1. The decision to engage: preparedness and willingness. Yu’s group of graduate students in statistics and electrical engineering and computer sciences were ready to take on this challenge. They were already focused on solving real-world problems, had a collaborative and credit-sharing culture and were accustomed to working in smaller teams of two to five people.
2. Effective human organization: divide and conquer. The team had two weeks to come up with daily forecasts, at least five days out, for each of the more than 7,000 hospitals in the United States and put the forecasts on the Salesforce platform for PPE distribution. To accomplish this, the group formed subteams for data gathering, modeling, logistics/visualization for porting the results, and PR to help organize Response4Life volunteers, scour media reports to check their predictions and contact hospitals.
3. Gathering data and context: scraping, human contacts, and media reports. To meet the challenge of gathering data, the team turned to web scraping, human contacts and monitoring media reports to collect death-count data at the county level. The team decided to predict death counts for each county as the basis for a hospital-level severity index, believing death counts were more reliable and relevant than case numbers at that point in the pandemic.
4. Data quality control: in-house data cleaning and curation. Because most of the incoming data was messy, quality control was critical. The team set up a pipeline to quickly adapt to unexpected changes from incoming data sources, making sure data was scraped, cleaned and curated.
5. Speedy development and validation of many prediction algorithms. With a pressing deadline and a rapidly evolving pandemic, the team decided to build on Yu’s work on audio compression 20 years earlier at Bell Labs due to the similar dynamic nature of audio and pandemic data. Five COVID-death predictors were simultaneously developed by the modeling team using past death and case counts, as well as demographic, social, economic and health factors from the same and neighboring counties.
6. Uncertainty: measurement and empirical validation. Because the situation was changing quickly, the team needed to assess uncertainty using a prediction interval (the range within which the data would fall) with a justified level of confidence. An interval was created using the maximum absolute relative error to add and subtract from the predicted (accumulated) death count of the future fifth (or seventh or 14th) day to ensure there was enough time to distribute PPEs from the manufacturers.
7. Communicating results: interactive visualizations, open-source code, and a web interface. Quickly and effectively communicating results was extremely important in the rapid-response setting. The team posted their open-source code and visualizations on their website, as well as freely posting all of their work on GitHub, making it easy for other groups to use their code and curated data. All data, both curated and raw, was put in a common table-format which could be easily distributed.
“Although these principles are described in the context of working with Response4Life, many of the principles overlap with those in standard data-science,” Yu said. “But in this project, we emphasized the need for a rapid response due to the fast spread of lethal COVID-19 hot spots across the country. There were no other county-level prediction models available in the U.S. until after our paper was submitted on May 16, 2020.”
The technical details of the team’s work are described in a research paper in Harvard Data Science Review. Yu discusses the process of coming up with their prediction model in the context of their engagement with Response4Life in an interview with Xiao-Li Meng as part of Harvard Data Science Review’s Conversation with Authors series.