Fall 2019 Discovery Projects

Work at Home Vintage Experts


WAHVE pairs companies looking for specific skills with the veteran talent who have them. Businesses get the quality and knowledge they need, while ”vintage” professionals (people over 50) get to phase into retirement working from their home office. We have analyzed over 52,692 unique applicants and are looking to for students to participate in NLP analysis of the data.

Cal Alumni Association Data Science Project

Cal Alumni Association

The Cal Alumni Association Data Science Project is a collaboration between the UC Berkeley Division of Data Sciences and Cal Alumni Association to assess the correlations between participation in alumni engagement activities and CAA mission fulfillment.

Neural Networks for Irregular Time Series: Online Machine Learning for Streaming Prediction

Lawrence Berkeley National Laboratory

As a part of the effort to evaluate the potential of neural networks for scientific applications, we are engaged in exploring the effectiveness of neural networks for predicting strongly irregular time Series.  The goal is to understand the limitations of the current neural network designs and the best ways to train the neural networks for streaming data. Initially, this exploration will be performed with CPU/GPU software.  Eventually, we anticipate converting the best neural networks onto an ML hardware system.

Machine learning based event classification for Higgs Boson property measurements

Lawrence Berkeley National Laboratory

An unprecedented amount of data are collected by the ATLAS experiment at the Large Hadron Collider experiments, which enable the discovery of new physics beyond the established Standard Model of particle physics. The huge data sample serves as a perfect test ground for machine learning algorithms that deal with pattern recognition, sequence analysis, and classification and regression, etc.. Students will work with researchers at the Berkeley lab to develop machine learning based techniques to improve the identification and reconstruction of particles produced from high energy collisions, in order to enhance the sensitivity of the experiment for discovery. Scientific publications may be produced as a result of this project.

Charter Schools and the Business Age

UC Berkeley Sociology Department

How does the push to run schools like businesses--complete with performance targets, incentives, and centralization in culture and governance--shape the growing charter school sector? Which charters survive and thrive in this political climate: those that stress standards-based rigor and college-readiness (traditional model), or those that prioritize independent thinking and socio-emotional development (progressive model)? And how does this differentiation affect charter school segregation--that is, do progressive schools serve white students in affluent, liberal communities while traditional schools serve students of color in poor or conservative communities?

Computational analysis of academic texts

UC Berkeley Sociology Department

Professor Heather Haveman and doctoral candidate Jaren Haber are analyzing over 3 million articles gathered from JSTOR, the leading online repository of journal articles for the social sciences.  Research in many academic fields requires reviewing the literature to document what we know about a phenomenon and what gaps exist in our knowledge. The number of journals and articles published have increased enormously over the past 20 years, making it increasingly difficult for scholars to keep up with the literature.  We are developing a flexible and reproducible method to review academic literature that takes advantage of massive online collections containing nearly all articles published in academic journals. The goal is to harness computers to review the entire corpus of published literature, by charting engagement with specific theories or topics over time and across subfields.  This computational method stands in sharp contrast to the time-honored practice of human reading, which can cover only a small fraction of the published corpus.

Let's Make it COUNT

West Big Data Innovation Hub

Building upon the U.S. Census Bureau's Statistics in Schools program that reaches more than 56 million public school students, this project promotes awareness for the first-ever digital count of the United States — the 2020 Census. Participants are developing interactive Jupyter notebooks and data visualizations in collaboration with the National Science Foundation West Big Data Innovation Hub, the Lawrence Hall of Science, and a network of advisors across the country. 

Census Research

West Big Data Innovation Hub

Using census data, this collaborative aims to provide insights regarding neighborhood wealth and inequality levels across the United States.

Water Data Collaborative

California State Water Resources Control Board

California State Water Resources Control Board seeks to create comprehensive, high-quality data on water rates for agencies across the state. This will allow easy interoperability in joining water use, income, and other demographic data sets.

Data Enabled Donations

West Big Data Innovation Hub

A collaboration with West Big Data Innovation Hub and the State of California to address how we can better respond to natural disasters. This project aims to find ways to optimize the supply and demand of physical donations that shelters receive following a disaster. 

Fall Armyworm

Mercy Corps

Currently, over 300 million Sub-Saharan Africans depend on maize as their primary staple food crop. In 2016, a pest indigenous to the Americas called the Fall Armyworm (FAW) arrived in West Africa and has been spreading at breathtaking speed, attacking maize in 44 African countries in just one year. Students will help MercyCorps create a tool to help indicate the location and intensity of FAW outbreaks and help identify the risks of future outbreaks to enable humanitarian organizations to effectively target resources.

Remote Sensing for Indicators of Health in Smallholder Crops

Mercy Corps

In Africa alone, smallholder farmers are 33 million strong and represent 90% of all food production. Healthier crops translate to higher yields. Higher yields result in better socio-economic, health and food security outcomes. Mercy Corps operates several large agriculture programs across Africa because of the critical link between agricultural productivity and development. As such, Mercy Corps is partnering with UC Berkeley's Division of Data Science and Information to seek new and innovative ways that involve machine learning to help smallholder farmers strengthen their resiliency in the face of dynamic threats like climate change, pests and diseases, and disasters.

Applying machine learning to assess neurological health utilizing eye motion

C. Light Technologies

C. Light is a neurotech and AI company that utilizes eye motion measured on the scale to quickly and objectively assess neurological health. We’re first applying this to improve the clinical care of patients with multiple sclerosis (MS). MS is unique among neurological disorders in that there are over a dozen therapeutics currently on the market. However, it currently takes >2 years to detect that a medication is ineffective for a particular patient utilizing the clinical tools available. C. Light hopes to shorten this process, getting patients on the best medications sooner, empowering doctors to make medication decisions, reducing costs to the healthcare system, and improving patient outcomes. Leveraging advanced data science is core to our value proposition and will enable us to make the most powerful conclusions from our unique data.

Advanced statistical modeling for concussion recovery

C. Light Technologies

This project is to analyze annotated eye motion data recorded from concussion and control patients. The project will involve data visualization and aims to classify the groups using either annotated features or the raw time series eye motion trace. 

Anomaly detection in solar microgrids

New Sun Road

New Sun Road's anomaly detection project is developing machine learning tools to highlight unusual electrical events at solar microgrids, with applications in predictive maintenance, theft detection, and beyond.

Data processing and management for Advanced Batteries

Coreshell Technologies

Coreshell Technologies is developing advanced coating materials to enable the next generation of rechargeable batteries. This projects aims to utilize data science to better process and analyze data collected from Coreshell's batteries and expedite the discovery of a scalable solution to improve energy storage for Electric Vehicles and Renewable Energy applications.

Lifestyle Banking


Upswot provides ongoing tracking of all crucial, positive, and negative events SME face in real-time based on non-traditional data sources. We track their issues, needs, and preferences to help Lenders & Insurers cross-sell more, better, and in a highly automated way

Calculating Corporate Culture and Performance


Elin.ai aims to build the future of work where distributed teams working on remote are united under one culture. Elin.ai aims to help companies decode their corporate DNA by measuring culture health through tracking patterns in the communication channels. Elin.ai's next step is to calculate culture ROI (culture health mapped to corporate performance). 

Dieta Analysis


Dieta aims to explore the relationship between the foods people consume and their digestive health as a result. In this project, data from the Dieta mobile app will be analyzed in order to learn these relationships. 

Predictive Autoimmune Diagnosis


ad.ai's software will process disparate data sources (eg. claims, genomics and user generated data) to better understand/ predict whether a patient will go on to develop an autoimmune disease.

New perspectives on gender diversity

Economics, UC Berkeley

The Gender Diversity Projects employs data sciences tools to identify bottlenecks in women’s career paths. Using a novel data sources covering employees’ career paths within a very large company over more than a decade, the project aims to detect patterns in the data that can tell us when employees experience systematic roadblocks in their career development. The goal is to first empirically identify factors that are driving these patterns, for instance whether bottlenecks exist due to a lack of qualified applicants or due to distorted selection rules, and second to test which changes to human resources practices can help to promote diversity.

Internet Interoperability Index

Center for Long-Term Cybersecurity

Governments, firms, and citizens are again debating to what extent the Internet is, or should be, a “global” infrastructure for communication and commerce—the “end of geography and death of distance” on one hand, and on the other, phrases like the “splinternet.” Our work brings measurement to this debate. We will create an “Internet Interoperability Index,” a set of proxy measures which, in aggregate, will allow us and others to measure whether and to what extent the internet is becoming more global and uniformly-accessible, or more nationalized and siloed.

Exploring automatic, language-independent phonetic transcription

Linguistics, UC Berkeley

Speech recognition systems are largely developed on a language-by-language basis and require high volumes of material for a given language to train and test a model. This method is prohibitive for many languages, especially those that lack the large corpora of data required to train these models. The goal of this project is to improve speech recognition technology by exploring ways to automatically and systematically transcribe language sounds, creating a textual representation that can be fed into further models.

Machine learning to understand ecosystems and the global carbon cycle

Environmental Science & Policy and Management, UC Berkeley

Currently, over 300 million Sub-Saharan Africans depend on maize as their primary staple food crop. In 2016, a pest indigenous to the Americas called the Fall Armyworm (FAW) arrived in West Africa and has been spreading at breathtaking speed, attacking maize in 44 African countries in just one year. Students will help MercyCorps create a tool to help indicate the location and intensity of FAW outbreaks and help identify the risks of future outbreaks to enable humanitarian organizations to effectively target resources.

Safe Water, Healthy Communities: Dallas Stream Water Quality Analyses

City of Dallas

The project included data analyses, trending and mapping to use existing data in an effective way to direct stormwater outreach and enforcement towards improved water quality across the City. The project also develops a tool to allow the public to better understand water quality conditions across the City.

Exploring the Achievement Gap in Berkeley Public Schools

Division of Data Science and Information, UC Berkeley

This project will use publicly available state testing data to make informed school district policy decisions. The goal is to make easily understandable analysis available to different stakeholders from data that is currently publicly available but difficult to access.  

Foundations and Asian American and Latino Civil Society

Political Science, UC Berkeley

This project aims to explore how foundations have influenced the agendas of Asian American and Latino civil society by funding particular types of organizations over others.

Understanding Flight Performance Data


This project investigates flight behavior of airline pilots and the factors affecting that behavior in a series of challenging simulated flights. Our study produces a large, heterogenous data set including a variety of types of written records, a large corpus of simulator log files, and associated eye tracking data. We want to compare pilot activity on different flights, on a variety of measures, such as the time between two events (e.g. an initiating challenge and following response) and the frequency of particular patterns of behavior. The project will need identifying, coding, and analyzing the occurrence and timing of behaviors and linking behaviors in different data types (captured by the simulator, by eye tracker, or by manual annotations). Relevant interests and skills of the student would include time-series data, management of large data sets, statistics, eye-tracking, attention, and aviation. The analysis will be multi-faceted and will allow some flexibility to design the project to capitalize on the  student’s specific skills and interests.

Sumerian Networks

Near Eastern Studies, UC Berkeley

Administrative documents from the city of Puzriš-Dagan (modern Drehem, southern Iraq), written in Sumerian cuneiform on clay tablets, more than 4000 years old reveal a dense network of officials, members of the royal family, foreigners, and gods. Who are the central people in this network, what cliques can we identify and who are the bridges between those cliques? This research will attempt to answer these questions.

Ancient World Citation Analysis

Digital Humanities / D-Lab / Near Eastern Studies, UC Berkeley

The goal is to build a citation network with a 2TB collection of documents from the fields of ancient Near Eastern Studies, Classics, Archaeology, and Middle Eastern Languages. The result of this project will make this collection more internationally accessible for research by the scholars in these fields. First we will run the OCR in batches using Savio, a cluster computer through Research IT. Next we will data-mine the results (using NER methods) for both citation analysis and bibliographic analysis. The result will be a multi-modal network of authors-to-authors (i.e. who cites whom?) and authors-to-primary sources (i.e. who cites what?), with links to the OCRed text in 'bags of words' (to avoid copyright issues). We will also introduce tools/methods for textual analysis (e.g. Gensim).

Improving CalAnswers

Vice Chancellor of Finance

We want to build an interactive ad-hoc BI Reporting solution where users can get high level data with a simple search query against the data source (Oracle Database, Flat File). The main aim of this application will be to simplify the way users build ad-hoc queries using key words (similar to twitter hashtags) such Year, Month, Department, Employee Count, Average Years of Service, Enrollment, Revenue, Budget etc. Application search / query page we want to display some example keywords and canned Queries.

GoodlyLabs Data Generation

Goodly Labs

A model is only as good as its dataset. Students will help parse, clean, and tag datasets for use in Public Editor, Demo Watch, and Liberating Archives.

Public Editor

Goodly Labs

The Citizen Science Solution to Media Misinformation Students on this project will help develop and test a collaborative web app guiding thousands of internet volunteers to read through the most shared news articles and find evidence of misinformation in the content. Working with a national coalition of social science researchers and journalists, a Nobel Laureate, cognitive scientists, and software designers/developers students will see first hand how the social good technologies of the future get built. And students’ individual efforts will be essential to the creation of a world where the public has confidence in its capacity to discern truth from fiction and fact from opinion.

Demo Watch

Goodly Labs

Discovering Patterns of Peace and Violence between Police and Protesters Students on this project will use a collaborative web app to process the information in over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. Data processed by students will be used to find patterns of peace and violence, which can be used to scaffold broad public conversations, and shift the behavior of police and protest strategists. Students work will be used to create artificial intelligence able to understand dynamics between cities and movements, and recommend policies more likely to result in peaceful and effect political expression.

Liberating Archives

Goodly Labs

Using the Government Publishing Office’s Congressional Hearing Transcripts as an example, the project leads will guide participants through their own data liberation project, in which they will: scrape the web for document files while retaining document metadata; programmatically find and extract meaningful data objects within the documents; link those objects to external databases; prepare all this compiled textual data for computational analysis in R and Python; and host their newly formed database so that the public and other researchers can launch their own studies of the data.