February 10, 2021

Over the past five years, the Data Science Discovery program has helped hundreds of Berkeley undergraduate students join data science research projects. Students work on a team-based research project for a semester with one of Discovery’s project partners, which yield from a variety of fields and comprises Berkeley faculty, campus-affiliated start-ups, and non-profit agencies. The research projects tackle real-world issues utilizing data science techniques. Students also earn course credit toward their degree for their participation in the Discovery Program. Previous notable Discovery partners include the city of Dallas, the National Aeronautics and Space Administration (NASA), and the United Nations Office for the Coordination of Humanitarian Affairs (UN OCHA).  

“We want to help students find their place within the broad world of data science, and prepare them with real-world experience that is going to pay dividends after graduation,” says Arlo Malmberg, the program manager of Discovery. “This semester alone we're working with more than 150 students on 43 projects as diverse as visualizing data on natural disasters with the SF Chronicle, using machine learning to predict train delays with BART, and using natural language processing to recommend personalized treatment plans with UCSF.” 

One such project is to build language models and natural language processing (NLP) tools for Ancient World Computational Analysis (AWCA) with the goal to create a citation network from any collection of PDFs. This semester, the team is fine-tuning previous work and building workflows in Jupyter Notebooks. The project partner, Dr. Adam Anderson, finds that the Discovery program is an excellent opportunity for “multi-level involvement” (from students to faculty) and enjoys the “less formal, more hands-on” aspect of the meetings, which allows the team to freely learn from one another.

Another Discovery project is to improve the San Francisco Chronicle’s California Fire Tracker, a real-time interactive map that displays the spread of wildfire and smoke across the state. “We're looking for more live-updating data layers to show, and it's up to us to brainstorm what layers viewers might want to see, and find datasets for those online,” said Owen Zhang, one of the undergraduates on the project this semester. He is currently working on converting PM2.5 (air pollutant) concentration to AQI (Air Quality Index) and adding a fire-fuel layer to the map so that viewers can easily see areas with abundant dry vegetation (i.e., areas that would feed wildfire). 

As someone who enjoys looking at maps, Owen likes learning about the process behind creating digital maps. The project exposes Owen to a lot of Geographic Information Systems (GIS) data, and though it can get overwhelming, he finds it very interesting. “The world of real data science work is very different from what I see with my background as a software engineer/computer scientist,” Owen said. “Often people think of the two as highly related, and they are, but the challenges presented by these two lines of work can be very different.” 

One of the things students can expect to gain from the Discovery Program is exactly that⁠⁠—witnessing first-hand the “messy data, project management challenges, and unanticipated discoveries that come with data research,” said Malmberg. These hands-on experiences help many Discovery students land great internships and full-time opportunities; such was the case for Anna Burns, who was a student leader on the Discovery project with Work at Home Vintage Experts (WAHVE) and later joined the company as a full-time junior data scientist.  

By the end of this academic year, the Discovery program will have connected a total of around 1000 students to data science projects beyond the classroom⁠⁠. Discovery’s reach and impact will only grow as it continuously strives to improve the experience for both students and research project partners⁠—the program soon plans to release DiscoveryHub (a JupyterHub) so all teams have 1-click access to GPU(graphics processing unit)-enabled Jupyter notebooks and is in the process of developing an exciting new paid internship program for undergraduate students. 

