Fall 2020 Discovery Projects

Lawrence Berkeley National Lab - Modeling Access Pattern of Large Scientific Data Repository with Machine Learning

Large amounts of data is archived on the tape archive (HPSS), especially for those experiments that produce large volumes of data every year.

Berkeley Police Department- Simulating Alternative Responses to Calls for Service

The Berkeley City Council recently voted to audit the "calls for service" (CFS) received by the Berkeley Police Department (BPD) to determine the feasibility of transferring the response to certain types of calls to alternative emergency response

UC Berkeley School of Law - Indigenous Brands and Social Movements

Have you noticed how many brands in the marketplace use indigenous/Native American - oriented terms and imagery?

Exploring the Achievement Gap in Berkeley Public Schools

Working with state and district data looking at achievement gaps in Berkeley Public Schools.

Building Integrated Solar Photovoltaics Assessment via ML

With the advent of novel image classification algorithms, the opportunities to apply them to the energy space are increasing, too.

DaanMatch - A decision-making tool/method to facilitate the correlation of NGO’s and Corporation’s datasets for an efficient fund/aid allocation

DaanMatch is a project to address social inequalities in fund distribution in India. Recent data shows that around 60% of organizing funding is allocated to projects in urban areas and that the most benefited are a few multinationals NGOs.

California Partners for Advanced Transportation Technology (PATH) - Erroneous High Occupancy Vehicle (HOV) Degradation

In a recent review of Performance Measurement System (PeMS) data quality for the Connected Corridors project along the I-210 corridor, it was discovered that almost 10% of HOV loops along 30-miles of freeway were actually located in the mainline.

NHERI SimCenter - Enhancing regional scale natural hazard simulations with artificial intelligence

The NHERI Computational Modeling and Simulation Center (SimCenter) provides next-generation computational modeling and simulation software tools, user support, and educational materials to the natural hazards engineering research community with th

SF Chronicle - Disaster Maps

The San Francisco Chronicle has produced a well-known Fire Tracker for several years, using data from NOAA as well as NASA's MODIS and VIIRS-I satellites. We are also working on a Flood Tracker for the Houston Chronicle.

Berkeley Law Center for Law, Energy & the Environment - Analyzing Research on the Environmental & Energy Impacts of the Digital Economy

This project will center around analyzing a database of sources for an emerging multidisciplinary field of practice in hopes of gaining helpful insight into the field’s major trends, focus areas, and trajectory.

United Ways of California

We would like to be able to build a mechanism that can collect data from photos of various documents and use that data to automatically fill out applications for COVID-19 relief and ongoing safety-net programs.

Creative Commons - Image Clustering and metric inference

CC Search (https://search.creativecommons.org/) is a media search engine maintained by Creative Commons which currently indexes metadata for around 500 million images.

Creative Commons - Image Popularity and Authority

CC Search (https://search.creativecommons.org/) is a media search engine maintained by Creative Commons which currently indexes metadata for around 500 million images.

Creative Commons - Linked Commons Graph Analysis

Creative Commons has collected graph data linking different web properties that use Creative Commons licenses. Nodes are domains (rather than, e.g., individual pages).

Goodly Labs - Public Editor

Students on this project will analyze data from a collaborative web app guiding thousands of internet volunteers to read through the most shared news articles of the day and label evidence of misinformation in the content.

Goodly Labs - Research Ready Government Archives

The Research Ready project seeks two students to help improve and maintain archives of government activity that researchers, journalists, and the public can easily query.

Goodly Labs - Demo Watch

The Demo Watch project has collected and is curating over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement.

East Bay Community Energy - Evaluating Alameda County CO2 Emissions and Optimizing Customer Programs Using Marginal Emissions Data

East Bay Community Energy’s goal is to provide cleaner and cheaper energy to East Bay Cities and customers. The goal of this project is to answer the following questions: 1.

Bay Area Rapid Transit - Reduce service outages of BART

Due to the drop in ridership from the COVID-19 pandemic, BART's operating budget has been drastically reduced.

City of Paterson - Mapping Eviction Trends in the City of Paterson

With unemployment rate at its historic highs and moratoriums on eviction expiring across the United States, tens of thousands of renters are expected to face homelessness.

Cal Alumni Association - CAA Data Analysis Project

Seeks to utilize data science to offer insights that could potentially improve CAA’s operational efficiency. 

Girl Effect - Using NLU to derive evaluation results from qualitative in-product feedback

Girl Effect builds products for girls in developing markets. We work in Rwanda, India, Ethiopia, South Africa and Tanzania.

Girl Effect - Speaking her Language: Using NLU to build conversational products for girls in developing countries

Girl Effect builds products for girls in developing markets. We work in Rwanda, India, Ethiopia, South Africa and Tanzania.

UCSF - A Precision Medicine Recommender System for Inflammatory Bowel Disease: a pilot study using real electronic health records data from UCSF

The idea of Precision Medicine is to move beyond our one-size-fits-all healthcare system and towards one where we make data-driven treatment decisions that take into account individual factors, like patient demographics, genetic background, and me

UC Berkeley, Biophysics - Identification and Classification of Intrinsically Disordered Regions in Proteins

Regions within proteins can be broadly classified into two types: ordered and disordered. Ordered regions assume a defined three dimensional structure and are identified by their unique sequence of amino acids.

BioXplor Inc - Data Visualisation for Covid-19 Free online Platform

Design & build a modern dashboard to show a summary behind various Covid-19 Ontologies powered by Biomedical Literature (gene, drug, disease, source, author, applications, gene mutations etc)

School of Information - Evaluating Accessibility on Congressional Websites

Voters and constituents increasingly look to the websites of elected officials to answer their questions about voting or constituent services.

Using AI and real-time data to power an early warning system for public safety and the recognition of emotions

We will build an AI system to detect negative emotion from vocalizations recorded by body cameras warn by police (building upon several published papers in my lab) and develop a warning system for police officers, that informs them of when t

ESPM - Addressing Structural Inequality in U.S. Agricultural Higher Education: An Assessment of Pedagogical Practices and Food Systems Coursework at Land-grant Institutions

In the last decade, many U.S. “land-grant” agricultural universities (including UC Berkeley) have turned a reflexive lens on the fact that their institutions marginalize certain community members.

Lawrence Hall of Science - Building Data Science Apps to teach Data Science

The project goal will be to develop a number of digital applications designed to help people of all ages learn about different data science concepts, like statistics and ML algorithms, by providing learners with intuitive examples and interactions

Mapping for Environmental Justice

Mapping for Environmental Justice (MEJ) creates easy-to-use, publicly-available maps that paint a holistic picture of intersecting environmental, social, and health impacts experienced by communities across the US.

Sumerian Networks

The goal of the Sumerian Network project has been to build reproducible socio-economic networks from the Ur III textual archives.

The Tempest Media - AI Content Management System for The Tempest Media

The Tempest is a digital publishing platform that has published hundreds of articles over the past 5 years.

Wordnik - Hyphenation Project

Wordnik provides a hyphenation API, with data licensed from traditional dictionaries.

Voice of Specially Abled People - Business Case for Investment into MMR vaccination by Developing Nations

MMR vaccine helps reduce birth defects, similar to Polio vaccine where most of the world is now polio-free because of serious efforts on Polio Vaccination, globally.

NLP for Cannabis Text Data

For this project research apprentices will use Python to write code scraping, munging, and classifying product data to better understand the dynamics of the United States cannabis industry.

Wordnik - Etymology Search

In this project, we hope to build an etymology search tool and API for Wordnik users. We'll digitize an out-of-copyright etymological dictionary and pull data from Wiktionary, and create an appropriate datastore.

Group Dynamics on Reddit

I downloaded 10 years of Reddit data (metadata and content). I am looking to clean the data and run some statistical models to examine what predicts group commitment (i.e., commitment to subreddit).

Berkeley School of Law - Empirical Examination of Corporate Rebranding and Trademarks

Companies are frequently found registering new trademarks or modifying existing trademarks as part of their rebranding efforts. Several reasons such as the change of company management teams may explain why rebranding decisions are made.

UCSF - Clinical Natural Language Understanding using transformer models and extensions incorporating tabular data

The advent of the BERT language model has achieved state of the art performance on a variety of Natural Language Understanding (NLU) tasks such as question-answering.

UCSF - Clinical text classification/information extraction to understand real-world treatment effects at a large, academic medical center

Every time you visit the doctor and watch her document your complete medical history, your data is being captured by huge electronic health records (EHR) systems on the backend.

Innovations for Youth - Exploring spaces and places of violence against young people experiencing homelessness

The study interviewed young people experiencing homelessness and accessed administrative data, and those data have been used to construct geospatial analyses of sites and pathways of violence.

Powerside - System Telemetry Analysis

Powerside wishes to develop value condition monitoring algorithms for high-value systems with generally high consequential costs of failure... industrial, medical, transportation, government, telecoms, for example.

Berkeley School of Law - What's Important to the Supreme Court

The Supreme Court of the United States has essentially unconstrained discretion to set its own docket.

WITI@UC, CITRIS and COE - Visualizing Women in Tech Data at UC

As part of continuing research in to the state of women in tech, WITI@UC, would like to enlist students to visualize data from CalAnswers and UCOP to show percentages and changes over time in the parti