Spring 2019 Discovery Projects

Humanitarian Data Exchange

UN Office for the Coordination of Humanitarian Affairs

UN OCHA (Office for the Coordination of Humanitarian Affairs) runs the Humanitarian Data Exchange based in the Hague where humanitarian agencies can upload their data sets for collaboration and sharing. Microsoft Research is working with UN to guide a team to assist in the creation of a process or resource whereby the HXL data tags could be applied to all 5,000 data sets to improve data analysis.

Bay Area Trip Choice Model


There are many externalities from driving; time lost in traffic, climate emissions, local air pollution and more. This project aims to aid policymakers working to limit these ills and improve travel in the Bay Area, particularly through pricing policies – adjustments in tolls, parking rates, etc. We aim to build a simplified commute model by simulating how people get to work. Options such as driving, taking transit (etc.) and the monetary and time cost of each can be constructed through a mapping API. 

Fall Armyworm

Mercy Corps

Currently, over 300 million Sub-Saharan Africans depend on maize as their primary staple food crop. In 2016, a pest indigenous to the Americas called the Fall Armyworm (FAW) arrived in West Africa and has been spreading at breathtaking speed, attacking maize in 44 African countries in just one year. Students will help MercyCorps create a tool to help indicate the location and intensity of FAW outbreaks and help identify the risks of future outbreaks to enable humanitarian organizations to effectively target resources.

Economics of Disaster Response

West Big Data Innovation Hub

Working with West Big Data Innovation Hub and the State of California to address how we can better respond to natural disasters. Goals include finding ways to better optimize the demand and supply of physical donations and understanding the economics of disaster prevention and relief. 



BEACO2N is a network of over 100 air quality and climate sensors. BEACO2N is undertaking projects on everything from understanding instrument calibration, to describing the composition of local plumes, to characterizing connections between weather and the observations, to visualization of the observations. 

Water Data Collaborative

California State Water Resources Control Board

California State Water Resources Control Board seeks to create comprehensive, high-quality data on water rates for agencies across the state. This will allow easy interoperability in joining water use, income, and other demographic data sets.

Work at Home Vintage Experts


WAHVE pairs companies looking for specific skills with the veteran talent who have them. Businesses get the quality and knowledge they need, while ”vintage” professionals (people over 50) get to phase into retirement working from their home office. We have analyzed over 52,692 unique applicants and are looking to for students to participate in NLP analysis of the data.

Building damage detection from Satellite Images

Berkeley Seismology Lab

Satellite images and other remote sensing data provide us with a way for quick damage evaluation after hazards like earthquakes. Currently, more and more data are made available to the public to use. The motivation behind this project is to extract the damaged building or other types of infrastructures from the satellite images or other remote sensing approaches after the earthquake. Students will work with a researcher at the Berkeley Seismology Lab to develop machine learning models for object detection or classification to identify the damaged buildings or other critical information after the earthquakes. 

Text analysis of Campaign Messages in French Elections

Department of Economics

This project investigates the extent to which politicians adapt their discourse to electoral competition. We combine a new dataset of 30,000 manifestos issued by individual candidates at French legislative elections between 1958 and 1993 with computational text analysis to measure changes in discourse over the campaign - in particular, between election rounds

Charter Schools and the Business Age

UC Berkeley Sociology Department

How does the push to run schools like businesses--complete with performance targets, incentives, and centralization in culture and governance--shape the growing charter school sector? Which charters survive and thrive in this political climate: those that stress standards-based rigor and college-readiness (traditional model), or those that prioritize independent thinking and socio-emotional development (progressive model)? And how does this differentiation affect charter school segregation--that is, do progressive schools serve white students in affluent, liberal communities while traditional schools serve students of color in poor or conservative communities?

Improving CalAnswers

Vice Chancellor of Finance

We want to build an interactive ad-hoc BI Reporting solution where users can get high level data with a simple search query against the data source (Oracle Database, Flat File). The main aim of this application will be to simplify the way users build ad-hoc queries using key words (similar to twitter hashtags) such Year, Month, Department, Employee Count, Average Years of Service, Enrollment, Revenue, Budget etc. Application search / query page we want to display some example keywords and canned Queries.

Ancient World Citation Analysis

Digital Humanities / D-Lab / Near Eastern Studies

The goal is to build a citation network with a 2TB collection of documents from the fields of ancient Near Eastern Studies, Classics, Archaeology, and Middle Eastern Languages. The result of this project will make this collection more internationally accessible for research by the scholars in these fields. First we will run the OCR in batches using Savio, a cluster computer through Research IT. Next we will data-mine the results (using NER methods) for both citation analysis and bibliographic analysis. The result will be a multi-modal network of authors-to-authors (i.e. who cites whom?) and authors-to-primary sources (i.e. who cites what?), with links to the OCRed text in 'bags of words' (to avoid copyright issues). We will also introduce tools/methods for textual analysis (e.g. Gensim).

Public Editor

Goodly Labs

The Citizen Science Solution to Media Misinformation Students on this project will help develop and test a collaborative web app guiding thousands of internet volunteers to read through the most shared news articles and find evidence of misinformation in the content. Working with a national coalition of social science researchers and journalists, a Nobel Laureate, cognitive scientists, and software designers/developers students will see first hand how the social good technologies of the future get built. And students’ individual efforts will be essential to the creation of a world where the public has confidence in its capacity to discern truth from fiction and fact from opinion.

Liberating Archives

Goodly Labs

Using the Government Publishing Office’s Congressional Hearing Transcripts as an example, the project leads will guide participants through their own data liberation project, in which they will: scrape the web for document files while retaining document metadata; programmatically find and extract meaningful data objects within the documents; link those objects to external databases; prepare all this compiled textual data for computational analysis in R and Python; and host their newly formed database so that the public and other researchers can launch their own studies of the data.

Deciding Force

Goodly Labs

Discovering Patterns of Peace and Violence between Police and Protesters Students on this project will use a collaborative web app to process the information in over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. Data processed by students will be used to find patterns of peace and violence, which can be used to scaffold broad public conversations, and shift the behavior of police and protest strategists. Students work will be used to create artificial intelligence able to understand dynamics between cities and movements, and recommend policies more likely to result in peaceful and effect political expression.


East Bay Ophthalmology

Glaucoma is estimated to affect 3 million people in the United States alone, and is the leading cause of preventable, irreversible blindness worldwide. Although glaucoma is a largely treatable disease, poor medication compliance occurs in an estimated 30% to 60% of patients. Ocuelar uses an innovative smartphone application to increase patient medication compliance through a Bluetooth smart cap that is able to automatically record when a patient takes their glaucoma eye drops. This unique opportunity at the intersection of health and data science will allow students to analyze the patient datasets from our preliminary study conducted at East Bay Ophthalmology to find creative trends that could directly impact patient care.


East Bay Ophthalmology

iCare is a humanitarian organization with the mission of providing ophthalmic and oculoplastic care to those that have the least access to it. Since 2006, iCare has partnered with local ophthalmologists to advance ophthalmic and oculoplastic surgical care in over 20 countries. For this project, students will conduct a retrospective analysis of patient health outcome data from Macedonia and China to quantify the efficacy of the program.

Analyzing Big Data from the Centers of Medicare and Medicaid

East Bay Ophthalmology

In the last three decades, big data has been applied to diverse fields, such as government, international development and education. It is only now that the US healthcare system gas begun to explore its under-utilized data. Under the supervision of Dr. Scott Lee, students will analyze the data available on the Centers of Medicare and Medicaid Services database to find innovative ways to understand clinician decision making in today’s healthcare system.

Traffic, Mobility & Sustainability in Megalopolis

California Institute for Energy and Environment

As cities are expected to continue growing, so are the challenges and complexities related to moving people around, and the environmental impact of having combustion-based vehicle fleets for mobility. In this project, you will be working with a large datasets from cities and their traffic patterns. We are expecting to implement learning algorithms to better understand urban environments, how people are moving, and approaches to reduce the environmental impact.

Analysis of the Electric Field in Near-Earth Space

Space Sciences Laboratory

The electric field plays a fundamental role in space. Yet, it remains poorly understood. The objective of the project is to analyze a large database of electric field measurements recently provided by the two spacecraft of the Van Allen Probes mission. The project aims to identify trends in the data, and ultimately to formulate a simple analytical model.

Neural Networks for Irregular Time Series: Online Machine Learning for Streaming Prediction

Lawrence Berkeley National Laboratory

As a part of the effort to evaluate the potential of neural networks for scientific applications, we are engaged in exploring the effectiveness of neural networks for predicting strongly irregular time Series.  The goal is to understand the limitations of the current neural network designs and the best ways to train the neural networks for streaming data. Initially, this exploration will be performed with CPU/GPU software.  Eventually, we anticipate converting the best neural networks onto an ML hardware system. 

Active learning on Chemical and Material Systems

Lawrence Berkeley National Laboratory

Conventional machine learning algorithms operate on fixed-length real-valued vectors, while real-world objects can often be more efficiently modeled as discrete objects containing non-linear structures and categorical attributes. As an example, a molecule is often visualized as a collection of atoms and bonds that fits naturally into a graph-based data structure. In this project, we seek to apply and advance recently introduced learning techniques such as graph kernel, message-passing neural network, and graph convolution, to construct models that can learn from chemical and material datasets that are encoded as graphs. The project will be expanded from multiple fronts, including algorithm design, software implementation and optimization, and application of existing algorithms to solve real-world scientific problems.

Machine learning based event classification for Higgs Boson property measurements

Lawrence Berkeley National Laboratory

An unprecedented amount of data are collected by the ATLAS experiment at the Large Hadron Collider experiments, which enable the discovery of new physics beyond the established Standard Model of particle physics. The huge data sample serves as a perfect test ground for machine learning algorithms that deal with pattern recognition, sequence analysis, and classification and regression, etc.. Students will work with researchers at the Berkeley lab to develop machine learning based techniques to improve the identification and reconstruction of particles produced from high energy collisions, in order to enhance the sensitivity of the experiment for discovery. Scientific publications may be produced as a result of this project.