Spring 2020 Discovery Projects

The complete list of projects for Spring 2020.

Campus Research Projects (Berkeley, UCSF, LBNL, CITRIS)

Department of Chemistry - BEACO2N

Analysis of a dense network of urban measurements. 1) assessing strategies fro converting raw voltages to concentrations. 2) Collecting information on emissions from disparate sources and formatting them for comparison to observations, 3) developing simple and complex models for comparisons of predicted and observed concentrations. 4) Visualizations that make the data publicly accessible.

Goldman School of Public Policy - Environmental Justice Mapping Project

The Environmental Justice Mapping Project's goal is to create a hub for environmental justice data across the country. With guidance from environmental justice organizations, we aim to aggregate this data into a single indicator of vulnerability for every state, similar to CalEnviroScreen and the Washington Environmental Health Disparities Map This will fill an existing gap in data accessibility and provide a valuable tool for advocates and policymakers. The final product will be an online map and datasets combining census, environmental pollutants, and health data for every state. Students with experience or interest in spatial data analysis, database management, Python, and web development are encouraged to apply.

Data-Enabled Donations (Data Collaborative Project)  

This project aims to find ways to optimize the supply and demand of physical donations that shelters receive following a disaster

Automated Water Purification (Data Collaborative Project)

Creating an app to remotely monitor and control arsenic purification systems.

College of Letter and Sciences - Student Success Analytics Platform 

Creating machine learning algorithms to predict student success using anonymized data.

Office of Undergraduate Research and Scholarship - Campus Discovery Opportunities Database

Help with the development campus database platform that will better showcase student experiential learning opportunities at Cal. Seeking students with database development and web design skills. 

Making Cyberspace Inclusive: Developing Inclusive Online Speech Detector

Many studies have focused on building a detection algorithm for hate speech, but no one has developed a similar algorithm for inclusion and belonging. The gap matters because inclusive cyberspace is not equal to the absence of online hate speech. It means people exchange ideas online in an inclusive manner. This project aims to fill this gap by developing a replicable and scalable research methodology that helps identify and examine incidents of inclusive online speech.

Assessing Research Usability for Development Practitioners in Southern Africa

Policymakers, NGO's, and businesses are overwhelmed with information from research on climate change and natural resource management. Southern Africa is a particularly diverse region with respect to forecasted changes in aridity, water access, and risk of natural disasters. While there is a wealth of scientific research on the potential effects of climate change and adaptation measures, it is unclear how frequently decision-makers across southern Africa use this potentially vital information. There is a growing awareness among academics that research must be made more accessible, but little guidance from the consumers of research on what barriers currently exist, or what formats may improve usability. This project seeks to answer these questions through direct data collection by surveys, including a randomized control trial of development practitioners in southern Africa. As part of the trial, this project will also apply innovative geospatial and climate forecasting methods through the Google Earth Engine in a reproducible, accessible way."

ESPM - Addressing Structural Inequality in U.S. Agricultural Higher Education: An Assessment of Pedagogical Practices and Food Systems Coursework at Land-grant Institutions

In the last decade, many U.S. “land-grant” agricultural universities (including UC Berkeley) have turned a reflexive lens on the fact that their institutions marginalize certain community members. These schools have begun to ask how inequities within their food and agriculture education perpetuate food systems injustices in the U.S. more broadly. This research project critically evaluates these nascent institutional efforts by examining how agriculture schools use pedagogical practices to improve equity and inclusion outcomes. We will build Python code to web scrape online course catalog descriptions at all 112 federally-funded agriculture schools to assess their pedagogical methods and types of food systems topics taught. This is a team-oriented, learning-by-doing project designed to draw on the skills and expertise in data science and social justice.

D-Lab, Digital Humanities, Near Eastern Studies - Ancient World Citation Analysis (AWCA)

The goal of this research project has been to build a generalizable workflow which enables both “close” and “distant reading” and language modeling of a digitized corpus in a field of study. The case study is the field of Near Eastern Studies, and the models we have made thus far lend themselves to both qualitative and quantitative analysis, capable of describing the research landscape over time. While we’ve made good progress toward this goal over the last year with word embeddings and topic modeling, we now hope to include n-gram and deep learning methods, including BERT implementations, which be will incorporated into the resulting network models.

D-Lab, Digital Humanities, Near Eastern StudiesSumerian Network Analysis

Cuneiform tablets from the ancient Sumerian city of Puzriš-Dagan (currently in Iraq) document deliveries and expenditures of a state center with close connections to the cult and to diplomacy. The texts are dated between approximately 2,050 and 2,000 BCE; the corpus, which is fully digitized, includes some 15,000 documents. By building a network of officials, recipients, and delivering agents we will get a clearer idea of the inner workings of this center. No knowledge of Sumerian is required.

Sociology - Computational Analysis of Social Science Research

"Professor Heather Haveman and doctoral candidate Jaren Haber are analyzing about 70,000 research articles gathered from JSTOR, the leading online repository of journal articles for the social sciences. We are developing a flexible and reproducible method to review academic literature that takes advantage of massive online collections containing nearly all articles published in academic journals. The goal is to harness computers to review the entire corpus of published literature, by charting engagement with specific theories or topics over time and across subfields. Specifically, we are developing a method to construct and validate dictionaries, lists of concepts (unigrams, bigrams, and trigrams) related to a specific theory or topic. We are looking for research apprentices run models to predict hand-coded scores on a subsample of documents using N-grams and various word embedding models, and to visualize and convey the results."

Sociology - Computational Analysis of Charter School Identities and Stratification

"Do charter schools' marketing strategies attract parents of specific race and class backgrounds, contributing to educational segregation by race and class? To answer such questions, our team in May 2018 used our supercomputer allocation to crawl the websites of every U.S. charter school open today, yielding a snapshot of over 6,000 cases. This crawler is no longer available--creating the opportunity to develop a state-of-the-art, universal, reproducible framework for capturing text data from the web that builds on our previous success. Specifically, we aim this semester to capture new data as well as data from back in time by drawing on multiple Python-based approaches (scrapy, selenium, and wget) and the Internet Archive. We are looking for research apprentices to develop this complex web-scraping pipeline (in Python) and/or build statistical models (in R) to predict Structural Topic Model loadings using school and district race and poverty and to visualize results."

Sociology & Haas School of Business - Pricing norms in cannabis markets

For this project students will use python to write code scraping, munging, and classifying product data to better understand the dynamics of the United States cannabis industry. Apprentices will apply their programming skills to 1.) turn messy unstructured data sets into shiny clean data sets available for reproducible research and 2.) apply the latest techniques in natural language processing to find trends and patterns in product description data. These data science techniques will help us uncover the political and cultural elements that affect market competition in the US cannabis industry.

Department of Linguistics - Exploratory cross-linguistic analysis of acoustic data

This project aims to better understand the relationship between acoustics and language, and apply novel machine learning techniques to extract patterns in acoustic data.

Energy Resources Group - Thermal Unit Characterization Based on Data

The project focuses on generating realistic data sets for renewable generation integration. We use publicly available time-series data from generation companies to characterize real resource availability. Classically the ratings of generators are obtained from manufacturer data sheets and federal forms. However, these sources do not represent properly the technical constraints that generators are subject to during the year. In this project we will use collected data from 400+ generators in Texas to develop a data set used to study renewable energy futures in the state considering technical constraints. The project requires knowledge of least-squares and the capability to learn piece-wise least squares, knowledge of optimization and mix integer programming is a plus but not a requirement. The project will be developed in the programming language Julia. 

Cal Alumni Association - Cal Alumni Association Analysis of Engaged Alumni 

The UC Berkeley Division of Data Sciences and the Cal Alumni Association (CAA) will partner to conduct data analysis with the goal of determining the correlations between the participation in various alumni engagement activities and the propensity to achieve CAA’s mission outcomes: 1) donate to CAA and the university; 2) advocate on behalf of the university; 3) volunteer for CAA and the university; and 4) participate/engage in additional CAA and university alumni activities.

Exploring the Achievement Gap in Berkeley Public Schools

Working with state and district data looking at achievement gaps in Berkeley Public Schools

UC Berkeley School of Law - The Docket of the Supreme Court of the United States: How does the Supreme Court decide what cases to take?

The Supreme Court of the United States has essentially unconstrained discretion to set its own docket. In order to earn review, four (out of nine) Justices simply have to decide that a case is sufficiently "important." But what makes a case important enough to merit the Supreme Court's attention? This project analyzes the corpus of the Supreme Court's decisions, and, more specifically, its descriptions of its decision to grant review, in order to discern what counts as "important" when it comes to Supreme Court review.

Image result for UCSF logo 

UCSF - Text Classification on Clinical Notes

Looking to perform supervised learning on endoscopy reports in order to enable clinical research on the comparative effectiveness of drugs for Inflammatory Bowel Disease (Crohn's Disease, Ulcerative Colitis). A secondary goal would be to perform unsupervised learning on these notes and the accompany biopsy report to understand how textual features may predict who will respond to different treatments (e.g. precision medicine).

UCSF - Knowledge Representation and Discovery from a large Clinical Text Corpus

Early-stage project to use representation learning methods to extract new insights from a large and high-quality clinical corpus

UCSF - Detecting heart disease with cardiac sensor data

Cardiovascular disease remains one of the leading causes of death worldwide. This project is based at the UCSF Medical Center and will use medical data from various sources to improve AI-enabled detection and prevention of cardiovascular disease. Data employed include electrocardiograms (heart electrical tracings), echocardiograms (heart ultrasounds), medical record and laboratory data (blood lab values), among others.

UCSF - Meta-analysis of individual participant data from clinical trials of Crohn's Disease

Thanks to medical research we now have many treatments for Crohn's disease; unfortunately, we don't know which ones are better than others (comparative effectiveness) and in whom they're most likely to work (precision medicine). No one (least of all drug companies) wants to pay for a randomized trial pitting drugs against each other. So let's use the raw data from all the trials to answer the question. This has never been done before in the gastroenterology/inflammatory bowel disease field.

Image result for lbl logo

LBNL - Geometric and Manifold Learning on Graphs and Unstructured Data

While feature vectors and pixelated images are being used as standard proxies for presenting data in machine learning, many objects encountered in a scientific machine learning context can only be suitably modeled as discrete objects containing non-sequential structures and categorical attributes. As an example, a molecule is often visualized as a collection of atoms and bonds that only fits into a graph-based data structure.

To apply machine learning techniques to unstructured datasets, the project will make use of and further develop GraphDot (https://pypi.org/project/graphdot/), a Python package for bridging graph-based databases to a wide array of kernel-based machine learning methods. The project will be expanded at multiple fronts, including ML algorithm design, software implementation and optimization. Applications to real-world scientific problems will be carried out to predict properties of molecules and crystal structures that are of importance in energy-related and pharmaceutical contexts.

LBNL - Streaming data analysis

We are studying prediction techniques for streaming data, with a couple of examples from financial applications.

Image result for CITRIS logo

Women in Tech Initiative (COE/CITRIS) - Innovation Resources Database

Collaborative database of resources for innovation and entrepreneurship - bit.ly/in-resources

WITI@UC (CITRIS/COE) - Data visualization of State of Women in Tech

As part of understanding the state of women in tech, WITI@UC would like to visualize data from CalAnswers and UCOP to show percentages and changes over time in the participation of women in tech fields, and specifically women on the Berkeley campus participation, persistence in various STEM majors.

Non-profit and Government Projects

Related image

NASA Ames Research Center - NASA Data Visualization

This project investigates the flight behavior of airline pilots and the factors affecting that behavior in a series of challenging simulated flights. > Needs. We need software that can “playback” state of the simulator so events can be viewed for coding. We also need to display the coordinates specifying eye fixations on this display. Some additional functionality is also needed. This is a continuing project with development in Python. > Background. Our study produces a large, heterogeneous data set including a variety of types of written records, a large corpus of simulator log files, and associated eye-tracking data. We want to compare pilot activity on different flights, on a variety of measures, such as the time between two events (e.g. an initiating challenge and following response) and the frequency of particular patterns of behavior. The overall project will need identifying, coding, and analyzing the occurrence and timing of behaviors and linking behaviors in different data types (captured by the simulator, by eye tracker, or by manual annotations).


Analysis and projection of trends in work patterns of seniors

Climate Policy Initiative - D3 Data Viz using climate finance data

Working with Climate Policy Initiative's dataset of global climate finance data, make interactive graphics with D3 to host on the Climate Policy Initiative website.

Innovations for Youth - Examining the spaces of violence against youth experiencing homelessness

The SF-YEAH project is a multiphase study in the School of Public Health exploring the ways young people experiencing homelessness in San Francisco experience violence and find safety and resources. The project aims to identify the places, spaces, processes, and structures of safety and violence for young people in order to inform policies and best practices for serving vulnerable youth. The study data is in multiple formats and conditions, including qualitative, geospatial, administrative, and visual data, requiring the organization, curation, and conversion of disparate data into formats that are analyzable using spatial methodologies. The data conversion aspect of the project will utilize collaborative decision-making and problem-solving to develop approaches to consolidating data for effective analysis. Students will be actively involved in the design and implementation of the data curation plan, working with diverse data types toward the creation of accessible databases that will inform policymakers, researchers, and social service providers toward violence prevention efforts for young people experiencing homelessness.

Drexel University - Community Air Monitoring Monthly Report

Real-time air monitoring in communities next to the Bay Area's five oil refineries measure various pollutants in the air once every minute. This project creates metrics to help communities understand what the minute-by-minute data says about air quality in the medium- and long-term.

Political Research Associates - Right Wing Sheriffs

Much has been made of the state and local government takeover strategy by the far right, but one area that is currently under scrutinized is the presence of so-called “Constitutional Sheriffs” on ballots and in office across the country. These sheriffs resist any form of gun control, believe that they have the power to essentially nullify federal law by not enforcing laws they do not agree with, and they believe that the county sheriff is the highest official in the country. They are not regionally contained. While the sheriffs began making headlines during the Obama presidency, reacting to any perceived threat to guns, throughout his campaign and now into his presidency, Trump has placed a strong emphasis on “law and order,” empowering and emboldening these sheriffs. The Constitutional Sheriffs and Peace Officers Association (CSPOA) is the largest network of these individuals. We want to analyze data on Constitutional sheriffs against other trends including immigration enforcement, religion, class, and race to deepen public awareness of these important elected law enforcement officers.

Public Health Institute - Pathologists Cancer Alert

The idea for this project is supported by population-based cancer registry data for bladder cancer incidence. At this time, the California Registry of Greater California (CRGC) is looking for support to implement the project in a sustainable way that offers a permanent, well-integrated communication solution in support of bladder cancer patients, and the information their physicians receive. This project will use natural language processing on electronic pathology records to identify reports that are missing critical diagnostic information. Once those reports are identified, a communicator tool will be developed to improve the quality of information available to physicians treating cancer patients and saving hundreds of lives in California patients who are disproportionate of African American, South Asian, Japanese, and non-Hispanic White backgrounds living in the Bay Area. This project delivers a previously unimaginable benefit to cancer patients—applying natural language processing to evaluate tumor tissue excised with and without muscle tissue, supporting patient care, and utilizing AI to compare actual pathology reports with industry standards. The project can immediately achieve a reduction in cancer mortality. We can apply the technology to the incoming feed of electronic pathology reports, and use the results to convey a signal about the completeness of the pathology on which treatment and care depend. The technology will support a real-time, efficient message which will save lives. Physicians will be able to use validated, actionable information from the state-mandated cancer registry with the same diligence they use to report individual cancer data to support surveillance, cancer control, and the reduction of the cancer burden.

East Bay Community Energy - Identifying Electricity Usage & Disconnection Patterns in Disadvantaged Communities

Using anonymized electricity usage data, disconnection data (for non-payment) research and identify patterns and propensity scores through Machine Learning applications

East Bay Community Energy - Forecasting Day Ahead Electric Load using a Machine Learning Application and Bottom-up Aggregation of Individual Meters

Using Hourly Data for a large set of individual meters, develop a machine-learning algorithm to predict the day-ahead aggregated load. Relevant characteristics of each meter will be provided, along with some exogenous variables


New Sun Road - Anomaly Detection in Solar Microgrids

New Sun Road's anomaly detection project is developing machine learning tools to highlight unusual electrical events at solar microgrids, with applications in predictive maintenance, theft detection, and beyond.

 - Data Visualization for User Experience 
Startup Projects 

We seek to apply data visualization techniques to help analyze large document collections from topic modeling methodologies. An abstraction of tens of million records from qualitative text sources and their comparative categorization in terms of prevalence, sentiment, and coherence. This project aims to help researchers obtain insights from previously unintelligible corpuses of text. This project will draw upon elements of natural language processing, sentiment analysis and data visualization in real-world industry applications.

Catalyst Off-Grid Advisors - Data visualization and analytics for off-grid solar in Africa

The team will deliver on two main workstreams. The first will be to develop a “dummy” dataset that would capture a simulated set of consumer receivable payments from thousands of customers. Various assumptions and inputs in the dataset would be flexed to yield multiple simulations. This dataset will then be used at a later date to develop portfolio level analytic tools. The second output from the team would be to leverage open-sourced Facebook AI tools that predict road and electricity infrastructure as well as population density. These tools would be fielded by the team on a set of African countries, and mapped using Open Street Maps or a similar platform.

Clarity Movement Co. - Automated reports for city-wide air sensor network

At Clarity, we have real-time air quality data from our sensors deployed in cities around the world. We'd like to work with a student to start exploring an incoming dataset and start analyses to understand how exposure to air pollution can vary across a city.

Kiwibot - Kiwibot delivery robots

Estimating the delivery time that satisfies clients.

Kiwi Campus - Dispatcher algorithm

We are developing a dispatching algorithm that optimizes the Kiwi Mate and robot assignation for the orders that we receive through our marketplace.

SimpleWater IncEnvironmental Health Estimator

While genomics data has garnered enormous attention, the vital and indeed dominant role of environmental exposure (exposomics) has been gravely overlooked. Join our fast-growing team to predict environmental health risks at every address in the United States.

Data for Social Good Foundation

Data for Social Good provides easy to use tools for non-profits and political campaigns who want to organize during and in between campaigns. We have created a data system is designed to support relational organizing in communities of color. It is longitudinal, customizable, and able to incorporate non-voter data. Our tools provide 16 GIS layers of data and make them easy to use for the common practitioner. We are at Beta Stage with a few customers who will test our database and apps within the state of California.