Fall 2020

Wordnik - Hyphenation Project

Wordnik provides a hyphenation API, with data licensed from traditional dictionaries. However, more than half of the unique words of English aren't in any dictionary (see http://www.sciencemag.org/content/331/6014/176). To provide hyphenation for unknown words, we'd like to implement the Liang algorithm (https://www.tug.org/docs/liang/) and combine it with a model (based on known dictionary hyphenations) that provides a confidence metric for the...

Using AI and real-time data to power an early warning system for public safety and the recognition of emotions

We will build an AI system to detect negative emotion from vocalizations recorded by body cameras warn by police (building upon several published papers in my lab) and develop a warning system for police officers, that informs them of when their emotions are escalating to risky or aggressive behavior.

View our work here.

Berkeley Police Department- Simulating Alternative Responses to Calls for Service

The Berkeley City Council recently voted to audit the "calls for service" (CFS) received by the Berkeley Police Department (BPD) to determine the feasibility of transferring the response to certain types of calls to alternative emergency response agencies. Using historical CFS data provided by BPD, this project seeks to simulate the effects on response time, staffing levels, and racial disparities of alternative emergency response strategies.

The project will advance in complexity throughout the semester and perhaps through multiple semesters. We'll start with a cursory analysis of...

SF Chronicle - Disaster Maps

The San Francisco Chronicle has produced a well-known Fire Tracker for several years, using data from NOAA as well as NASA's MODIS and VIIRS-I satellites. We are also working on a Flood Tracker for the Houston Chronicle.

In both projects, we believe the quality of the real-time data being delivered to our readers could be enhanced by additional data sources and custom processing.

View our work here.

NLP for Cannabis Text Data

For this project research apprentices will use Python to write code scraping, munging, and classifying product data to better understand the dynamics of the United States cannabis industry. Apprentices will apply their programming skills to 1.) scrape product data from publicly available websites; 2.) turn messy unstructured data sets into shiny clean data sets available for reproducible research, and 3.) apply the latest techniques in natural language processing to find trends and patterns in product description data. These data science techniques will help us uncover the political and...

Lawrence Berkeley National Lab - Modeling Access Pattern of Large Scientific Data Repository with Machine Learning

Large amounts of data is archived on the tape archive (HPSS), especially for those experiments that produce large volumes of data every year. dCache system is a storage management system that has an HPSS connectivity, and transfers data from HPSS to designated storage disks when the users request. The issue is that tape mounting on HPSS to retrieve the file is an overhead for every single tape mounting, besides seeking time for the requested file on the tape and reading time for the file. From the system point of view, reading multiple files out of the same tape can increase the lifetime...

Creative Commons - Linked Commons Graph Analysis

Creative Commons has collected graph data linking different web properties that use Creative Commons licenses. Nodes are domains (rather than, e.g., individual pages). For each node, we record the number of CC Licenses and their types found at that domain. The edges (as one might imagine) are links between two domains, and are weighted by how many links there are between the two domains. We currently have a graph with about 250,000 nodes.

- How can we quantify the ‘influence’ of a node within the ‘Commons’, i.e., the set of domains hosting CC Licensed content? It would be...

East Bay Community Energy - Evaluating Alameda County CO2 Emissions and Optimizing Customer Programs Using Marginal Emissions Data

East Bay Community Energy’s goal is to provide cleaner and cheaper energy to East Bay Cities and customers. The goal of this project is to answer the following questions: 1. Compare CO2 accounting methods between a locational/market-based approach versus a marginal approach for EBCE electricity load 2. Optimize electric vehicle (EV) and demand response (DR) schedules to minimize emissions and energy costs 3. Evaluate whether more accurate energy demand forecasts lower CO2 emissions.

View project submission...

Girl Effect - Speaking her Language: Using NLU to build conversational products for girls in developing countries

Girl Effect builds products for girls in developing markets. We work in Rwanda, India, Ethiopia, South Africa and Tanzania. Our product portfolio contains magazines, television shows, music videos, websites, chatbots and social media channels, all designed to drive uptake of knowledge and change in attitude around key subject areas defined in girl effect’s theory of change.

We try to make products which put our users at the front and centre of the experience. In our digital products, this means allowing girls to share their feedback and guide their own user journeys, rather than be...

WITI@UC, CITRIS and COE - Visualizing Women in Tech Data at UC

As part of continuing research in to the state of women in tech, WITI@UC, would like to enlist students to visualize data from CalAnswers and UCOP to show percentages and changes over time in the participation of women in tech fields, and specifically the participation and persistence of women on the Berkeley campus in various STEM majors. Research will look at intersectionality and potential opportunities for effective interventions to retain diverse talent in STEM majors. Participants will help define the research questions and will be FERPA...