July 26, 2018

The Data Science Discovery Program connects undergraduates with research opportunities that enhance their development as modern scholars. Undergraduates have the unique opportunity to build skills in software development, data analysis, and team management in an interdisciplinary way that demonstrates how Data Science can be applied in the real world.

people sitting in a room

By creating a platform for budding researchers to collaborate with a diverse cohort of institutions, startups, nonprofits, academic departments, and community impact initiatives, we bridge the divides between data, knowledge, and action. Participants find themselves deeply integrated in a multidisciplinary learning community of Berkeley’s graduate students, postdoctoral scholars, and staff immersed in the emerging field of Data Science. Discovery Program projects range from combating environmental corruption in Mexico, to analyzing transcription factor binding sites in fruit fly genomes, to studying the air we breathe in the Bay Area, to investigating potential disparities in hiring practices at UC Berkeley. The Spring 2018 Discovery Program engaged more than 23 partners and 85 undergraduate researchers in 30 projects. Project highlights follow.

If you are an undergraduate interested in applying for the program, or a graduate student, postdoc, or faculty member interested in proposing a project for undergraduate engagement, please visit the program website: https://data.berkeley.edu/education/discovery.

Highlights from Spring 2018 Discovery

Water Quality Analysis

Researcher Sean Furuta presenting

Researcher: Sean Furuta

Affiliated Institute: SimpleWater

Our research identified informal and formal conversations on water quality on the web through social media posts, broadcast transcripts, news articles, etc. This involved the web scraping of different sources from social media and news sites. After collecting the data, we manually tagged the text using binary classification. These labels revealed keywords and phrases that could then be used to categorize articles that predict water crises. This work will allow SimpleWater to intelligently deploy services and products to ameliorate the effects of water pollution.

Equal Opportunity Employment Compliance

researcher Hannah Garriott talking to somebody

Researcher: Hannah Garriott

Affiliated Institute: UC Berkeley Central Human Resources

The research objective of Central Human Resource (HR) is to make more effective use of data to improve diversity in the non-academic staff population (more than 8,000 employees) with respect to women, ethnic minorities, veterans, and individuals with disabilities. Central HR produces annual datasets and reports to comply with federal affirmative action regulations. These data include demographic breakdowns of the current and past workforce, and details of personnel transactions (new entries, internal transfers and promotions, and separations).

This project aimed to identify trends, successes, and opportunities for improvement. The desired outcome is to identify strategic areas in which to improve both UC Berkeley’s progress toward affirmative action goals and the campus’ general workforce diversity through the use of statistical analysis, data visualization, and data processing.

Bay Area Air Quality Analysis

Researcher: Ryan Lim

Affiliated Institute: California Institute for Energy and Environment

In collaboration with Google, researchers attached air quality sensors on Google Street View vehicles in the Oakland/Berkeley area to measure environmental and geographic data.

The goal is to determine relationships between environmental (air quality), geographic, and socioeconomic factors, with the actionable outcome being possible improvements in the public transportation infrastructure to better serve local communities and reduce air pollution.

Researchers employed Geographic Information System (GIS) technology systems, data visualization, and data analysis toolkits to study their data. Maps of AC Transit stops, BART stops, median household income levels, pollution levels, and high density traffic areas were used to extrapolate correlations between different factors that contribute to air pollution.

Cryptography of the Unknown Regions of Genomes

Researchers Alex Nakagawa and Jemima Shi

Researchers: Alex Nakagawa and Jemima Shi

Affiliated Institute: The Eisen Laboratory 

Our researchers attempted to match DNA sequences in more than 20 species of fruit flies to understand how evolution has influenced their respective genetic makeups. Understanding how variations in sequence location and permutation affect phenotype expression was used to compare enhancements that have developed within different species of fruit flies.

A neural network model was developed to predict how a specific sequence will interact with the rest of the genome in an actual specimen. Aligned DNA sequences were used in conjunction with motif probabilities to calculate the likelihood of a specific motif sequence. Researchers developed their own Python package and used d3.js to strategically visualize long sequences of DNA. 

Housing Development and Protection Analysis

Researchers: Ayesha Yusuf and Brian Truong

Affiliated Institute: UC Berkeley Center for Community Innovation 

Researchers built a dataset of rental housing units in California that are covered by rent control and “just cause for eviction” protections. They determined the number of tenants who are protected by housing regulations, along with an approximation of the percentage of units that are not covered by these policies and the demographic breakdown of these residents. They found that in Berkeley, out of 30,000 units, approximately 8,522 (28%) are covered under rent control and 23,088 are covered under “just cause for eviction” protections; and in Oakland, out of 100,000 units, roughly 11,121 are covered under rent control and 31,326 are covered under just cause for eviction protections. Further analysis is still being conducted, as an extrapolation of their datasets is still underway.

Environmental Justice Platform

Researchers Gaurav Mulchandani and Jo Apellanes

Researchers: Gaurav Mulchandani and Jo Apellanes

Affiliated Institute: Renewable & Appropriate Energy Laboratory

Switch-Mexico is a community-based initiative in Mexico that allows users to submit environmental reports anonymously via SMS, Twitter and Google Forms. It maps data including water pollution, air pollution, mining, illegal logging, power plant stations, refineries and landfills to assist populations that are affected by environmental injustice.

Our researchers helped develop the platform while scraping data from incoming reports to ensure that the data is uniform and processed for analysis and mapping. Text analysis and natural language processing were used to computationally understand the incoming reports. Image analysis and neural networks were then used to attempt to visually confirm the fidelity of reports via images from Google Maps.

COSI Telescope

Researcher: Devyn Donahue

Affiliated Institute: Space Sciences Lab

For the Space Sciences Lab, our researchers applied machine learning tools to the data-analysis pipeline of the NASA-funded Compton telescope COSI, which detects gamma rays with double-sided strip detectors. Their research resolved the multiple interaction ambiguity in the COSI telescope's double-sided strip detectors while taking into account charge sharing, charge loss, trigger thresholds, and dead strips. The previous greedy algorithm used to try and solve the strip pairing problem was not efficient.

Our researchers used Python, MEGAlib, ROOT, and the machine learning toolkit TMVA to analyze the interactions. They also performed Monte-Carlo simulations with Cosima (enables the simulation of most of the measurement scenarios encountered by X-ray and gamma-ray detectors in space and on Earth) to generate a well-defined data set.