Data Science Discovery Projects
Connecting students with cutting edge research data science research opportunities
Data Science Discovery Projects, a joint effort of the Berkeley Institute of Data Sciences, Division of Data Sciences, and Undergraduate Research Apprenticeship Program (URAP), provides undergraduates more opportunities to engage in hands-on, team-based discovery opportunities leveraging data science by connecting them with cutting-edge graduate student/postdoc research projects, community impact groups, entrepreneurship ventures, and educational initiatives across UC Berkeley. Data science is an intrinsically interdisciplinary field with broad reach, fast scaling capacity, and a large pool of interested students and projects. Students can earn units through Undergraduate Research Apprenticeship Program (URAP).
For questions, please contact Program Manager, Anthony Suen at anthonysuen@berkeley.edu
SPRING 2018 PROJECTS
Project Title | Research Lead | Affiliated Institute | Description |
Mapping Alzheimer Disease with Computer Vision | Maryana Alegro, Jessica Kudlacek | UCSF Grinberg Lab | This project works with neuropathology for researching Alzheimer's disease. Although we are a neuroscience lab, our current project is actually focused on computer vision. We have a big project for mapping TAU (protein associated with Alzheimer's disease) in the whole human brain and using these maps to validate PET scan tracers (imaging TAU in a living human brain would be a huge breakthrough in AD diagnosis and many pharmaceutical companies are working on that. A big problem is those traces are usually very unspecific and bind to other proteins besides TAU). In order to do so, we have built our own brightfield microscope scanner. We are scanning entire human brain tissue that yields several Gigabytes of data. TAU segmentation is performed using Deep Learning. |
Travel Patterns of Residents at Affordable Transit-Oriented Developments |
Miriam Zuk | Center for Community Innovation |
As part of our efforts to promote sustainable and equitable growth, the Center for Community Innovation is currently studying the environmental and other co-benefits of locating affordable housing near transit. This project aims to fill current gaps in the literature by examining the relationships between affordable housing, proximity to transit, and travel patterns through primary data collection and analysis. Sponsored by the California Air Resources Board, this study will help to inform the future of state and local policy pertaining to affordable transit-oriented developments. Primary GPS and survey data, which is being collected by a team of graduate and undergraduate researchers, will be analyzed to determine the travel patterns (e.g., distance, duration, mode, etc.) of residents of affordable housing developments both near and far from rail transit stations. Students will analyze GPS data collected with the e-missions app. |
Sumerian: Identifying Words and Meanings | Adam Anderson | Digital Humanities |
Sumerian is the oldest written language in the world, written in cuneiform. Tens of thousands of documents have been digitized in transliteration (in Roman alphabet) in an online the database BDTNS. These documents deal with taxes, agriculture, trade, and other matters. A database of Sumerianwords is available in the electronic Pennsylvania Sumerian Dictionary (ePSD). Connecting these two databases is surprisingly difficult – the current method of identifying words and linking them to entries in the dictionary (lemmatizing) is very time consuming. |
Mutiple Majors and Major Migration Visualization Tool | Sara Quigley | Office of Planning and Analysis |
Using the D3(link is external) JavaScript library and data from Cal Answers(link is external), OPA analysts have created several interactive visualizations(link is external) showing double and triple majors, as well as migration among different major programs. |
Threestones | Diego Ponce de Leon Barido | CITRIS |
The startup is building Latin America’s first digital energy marketplace, fostering both energy governance transparency and creating a valuable product using the large amount of data we’ve been collecting in Latin America. I’m attaching the position in case any of your students are interested in working with us. Students needs significant experience with Python & serious web scraping skills. |
Text Analysis Townhall Meetings |
Alexander Sahn | Political Science | The town-hall meeting where citizens and politicians interact face-to-face is often viewed as the idyllic form of participatory democracy. While rare in national politics, this form of participation is widespread in cities across the country. Yet, the content and consequences of these interactions remain unexamined. This project will investigate the inequalities in representation that arise from politicians using meeting participation to gather information about their constituents’ views. Policymakers and practitioners often point to public meetings as the primary obstacle to housing development in California cities, since a vocal minority can sway elected officials. We will examine records of city council and zoning board meetings in California cities to determine who participates, why they do so, and whether their actions affect political outcomes such as the approval of housing developments. |
MRI Brain Imaging | Maryam Vareth | UCSF Surbeck Laboratory for Advanced Imaging | Seeking students with programming skills mostly python, and some experience with CNN (Convolution Neural Network) and tensorflow platform, experience in imaging and medical images are plus. Along with someone with less experience as long as they are interested in task of “cleaning data” first, and meanwhile gaining experiences to join the group for more heavily programming tasks |
Effects of State and Local minimum wage Policies on Health |
Sylvia Allegretto | Center on Wage and Employment Dynamics (CWED) |
CWED seeks to understand the effects of state and local minimum wage policies on health and other outcomes. Research would involve processing large micro datasets (review codebooks, import raw data, clean and reshape for analysis), compute, tabulate, and graph descriptive statistics, run simple regression models. |
Low-cost Monitoring to Improve Grid Reliability |
Veronica Jacome | Energy Resources Group |
Frequent electrical power failures drastically reduce quality of life and weaken the economic opportunities enabled through access to the electric grid. High-resolution measurements of the frequency and duration of power outages are critical to understanding and improving grid reliability. These measurements are also necessary to study deeper economic and socio-economic questions about how unreliable electricity impacts economic development and growth. But, due to the lack of appropriate monitoring technologies, it is not economically feasible for many utility companies in the developing world to gather this data. This projects builds on a long partnership with the Zanzibar Electricity Corporation (ZECO), the sole electricity distribution utility in Unguja, Tanzania, to deploy the second generation of our novel, low-cost power grid monitoring system. This system monitors the location, duration, and scope of problems at low-voltage, distribution level of the grid in near real-time. Over the course of our proposed pilot, we aim to both measure and improve the reliability and accuracy of our system. Additionally, we plan to leverage our connections with ZECO to co-develop an interface that allows data from our pilot to directly help inform utility grid maintenance. |
Reducing Human Radio Frequency Interference |
Steve Croft | Berkeley SETI Research Center |
Breakthrough Listen is the world's leading search for extraterrestrial intelligence, using some of the world's most powerful telescopes to survey planetary systems around nearby stars for signs of technology. Students will develop code to extract features from large arrays of image-like data, visualize and explore patterns in the data using approaches such as PCA and clustering. Apply unsupervised and supervised approaches to reject human radio frequency interference and hone in on signals of interest. Develop interfaces allowing for labeling of data by volunteers. |
Inten.to | Konstantin Savenkov | Skydeck |
Currently, for each AI vertical there's a dozen of cloud service vendors with different APIs and unclear performance comparison. We provide a single interface to all of them and dispatch requests to the best vendor based on the data and client needs. This way, our clients have the optimal solution of their AI tasks with just a single integration and contract. We have this running for Machine Translation and soon launching Sentiment Analysis. The role will involve adding new AI verticals to our platform such as Entity Recognition, Text Summarization, OCR, ASR etc |
Evaluating SmallSat Performance | Andreas Zoglauer | Space Science Labs |
Evaluating the performance of a concept for a SmallSat astrophysics space mission to map the inner Milky Way in low-energy gamma rays and to observe the polarization form pulsars, AGN, etc. One key open question is how good we will be able to identify and suppress gamma-ray background originating from internal radioactive activation and subsequent decays as well as from the Earth’s albedo emission. Applying machine learning techniques to identify the background and distinguish it from source gamma rays will be the key topic of this project. |
MindsDB | Jorge Torres | Skydeck |
MindsDB’s goal is to make it simple to apply Deep Learning to existing problems. The reason we are doing this is that evidence shows that in most cases deep learning methodologies outperform intuition base models and traditional machine learning by a significant margin, yet there are many hurdles to implementing deep learning correctly. As a solution, MindsDB automates the steps needed to get high-quality predictions from most data. We essentially turn databases into an automated data science team. MindsDB is seeking students to building neural networks that can architect other neural networks that can best solve a problem. Building a system that can allow a neural network not only to solve a specific problem, but to explain why its making the decisions that it makes. |
Simple Water | John Pujol | CITRIS | SimpleWater is a CITRIS startup that provides water quality analysis and treatment solutions for individuals and commercial customers. As part of our water research, we're building out the technical capability to monitor water news at web-scale using crawler APIs, NLP, and big data techniques. SimpleWater is seeking student researchers to understand and catalog web-scale data from news sites, blogs, message boards, etc. with the intent of extracting conversations about water and water quality issues specifically. Assist in creating a system for monitoring these in real time. Students should have knowledge of Python, Statistics, Natural Language Processing is preferred). We are flexible on the level involvement but to make this worthwhile for us we'll need at least 12 student-hours per week of involvement. |
Bioxplor | Mark Rogers | Skydeck |
BioXplor is an augmented intelligence platform for data-driven hypotheses generation & novel predictions in biomarker & drug discovery. The virtual discovery platform is built on the intersection of big data, artificial intelligence and high performance computing, and works by enabling accelerated data interpretations from unstructured literature & structured multi-omics data. In house & external projects are focused on AI-assisted biomarker discovery, molecular target identification and drug repurposing, presently in cancer and rare disease. |
Increasing Staff Diversity | Tim Abdellah Fuson | Central HR | Central HR at UC Berkeley is trying to make more effective use of data to improve diversity in the non-academic staff population (over 8000 employees) with respect to women, ethnic minorities, veterans, and individuals with disabilities. We produce annual datasets and reports to comply with federal affirmative action regulations. These data include demographic breakdowns of the current and past workforce, and details of personnel transactions (new entries, internal transfers and promotions, and separations). The current project is a deeper dive into the data: identifying trends, successes, and opportunities for improvement. The desired outcome is to identify strategic areas in which to improve both our progress toward affirmative action goals and our general workforce diversity. |
Cryptography of the Unknown Regions of Genomes | Ciera Martinez | BIDS/MCB | We are approaching this problem from a computational perspective by mapping known characteristics of DNA and doing comparative analysis across ~25 different species of Drosophila (fruit fly). The first part of the project will be data wrangling, which consists of annotating and organizing the DNA data (long strings of letters). We will then create workflows to map features onto these sequences, combine existing datasets, and with the end goal of feeding the data into machine learning algorithms to predict function in non-coding regions of DNA. |
Jupyter Discovery Engine |
Joshua Quan | Library |
UC Berkeley Library, Division of Data Sciences, and the Berkeley Institute for Data Sciences are looking for student developers to create Jupyterhub extensions to facilitate discoverability and reuse of datasets for research, course modules, and assignments along with contributing to reproducibility newly published research. |
SkyAlert | Alejandro Cantu | Skydeck |
Skyalert provide early earthquake warning services to consumers and businesses in Mexico. SkyAlert operates the only private network of early earthquake warning sensors in America, covering over 280,000 Sq Km of earthquake-prone areas. When an earthquake occurs in the epicenter, we send an alert via mobile devices to millions of users in the areas that will be impacted by the earthquake. In the case of business customers, we send the alert through the hardware equipment using IoT technology that we have designed and manufactured. We can warn our users 10-120 seconds before the impact, depending on the location of the epicenter and distance traveled by the tremor. This allows users to evacuate and/or locate yourself in safety zones. The role will involve analyzing Earthquake Data coming in from the sensors, to determine the best way to issue Early Earthquake Warnings. Students should have understanding of data science, math, seismology's basic concepts, and computer science. |
Electric Systems Modeling in Multiple Scales for High Performance Computing |
Jose Daniel Lara | Energy Resources Group |
The project is to develop modular multi-energy system simulation platform in collaboration The National Renewable Energy Laboratory (NREL) and Los Alamos National Laboratory (LANL). In this work, a new framework to model and study the effects of renewable energy at multiple time scales and in different sectors will be developed. The current practices in renewable energy integration studies don't allow analysists to perform integrated assessments and require ad-hoc mixtures of software packages developed by different vendors, making the evaluation costly and in many cases incomplete. Moreover, current tools do not reliably integrate electric power systems modeling with other sectors that in the recent years have acquired great relevance to advance the integration of renewable energy, such as natural gas and water networks. The final objective is the development of a platform to model and analyze the interaction of renewable energy sources other sectors considering multiple timescales. The platform will be developed as an OpenSource tool available to operators and analysts to reduce the costs and technical challenges of performing a thorough assessment of renewable energy integration. The project will be coded in the programming Language Julia with a focus in High Performance Computing Applications. The job responsibilities include the development of code to build the models and the produce the accompanying documentation for future development. |
Energy Systems Modeling of Mexico (SWITCH- Mexico) |
Sergio Castellanos | Energy Resources Group |
The objective of SWITCH-Mexico will be to test your research skills by expanding on (1) literature on decarbonization pathways, and their methodologies, and (2) wrangling data (python, SQL, GIS). For this, your work will consist first in summarizing some literature and then proposing some exciting scenarios (ex.: energy efficiency, EVs, etc.) that you will help build in our still-developing model. The results will contribute and potentially inform policy makers in Mexico, so you will experience first-hand the impact of your work. We are also looking to implement a solid visualization who has been already established (first steps) in Javascript, and which we’d expect you to continue taking it to the next level. Evaluating Air Quality from Big Data - This is a big data problem around the evaluation of air quality data from transportation and one that will require a strong time commitment. Your tasks will involve: (a) literature review, (b) database scripting, (c) statistical analysis, (d) visualization and (e) proposed future work. Qualifications: We are looking for highly motivated and passionate students proficient in python, SQL, GIS, and/or postGIS who have the time availability to immerse in this project. |
Quantifying the Role of Partisanship in Rooftop Solar PV Adoption in the U.S. |
John Dees | Energy Resources Group |
Support or opposition for renewable energy policies often take on a partisan character in U.S. politics. However, solar PV and supporting state and federal policies benefit consumers across the political spectrum. In turn, the benefits of rooftop solar PV policies create interest groups apart from ideological leanings. This has important implications for the politics of solar policy, going forward. Hence, this project seeks to quantify role of political voting behavior on adoption of solar PV. We are looking for a student to compile, clean, and merge geospatial (GIS) data for each state’s voting data, Solar PV installation data, U.S. Census Bureau income |
Liberating Textual Data: Opening up Public Archives for Open Research |
Nick Adams | BIDS |
Students will learn how to gather, parse, and publicly share digital archives that are currently inaccessible for research purposes. Using the Government Publishing Office’s Congressional Hearing Transcripts as an example, the project leads will guide participants through their own data liberation project, in which they will: scrape the web for document files while retaining document metadata; programmatically find and extract meaningful data objects within the documents; link those objects to external databases; prepare all this compiled textual data for computational analysis in R and Python; and host their newly formed database so that the public and other researchers can launch their own studies of the data. Participants will work together on group assignments built around a textual source of their own choosing, each culminating in a new, final database, a research paper, and its presentation. This project is sponsored by Berkeley's Social Science Matrix. |
DecidingForce: Discovering Patterns of Peace and Violence between Police and Protesters |
Nick Adams | BIDS |
Students on this project will use a collaborative web app to process the information in over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. Data processed by students will be used to find patterns of peace and violence, which can be used to scaffold broad public conversations, and shift the behavior of police and protest strategists. Students work will be used to create artificial intelligence able to understand dynamics between cities and movements and recommend policies more likely to result in peaceful and effect political expression. |
UC Berkeley Business Intelligence Project |
Aswan | UC Berkeley Finance |
Berkeley's current applications are not the best when it comes to building ad-hoc queries to answer simple high-level questions like "What is current employee head count of economics department". This project will develop a BI tool where ad-hoc queries can be built in an easier and simple way to answers some of the critical questions. |
PublicEditor: The Citizen Science Solution to Media Misinformation |
Nick Adams | BIDS |
Students on this project will help develop and test a collaborative web app guiding thousands of internet volunteers to read through the most shared news articles and find evidence of misinformation in the content. Working with a national coalition of social science researchers and journalists, a Nobel Laureate, cognitive scientists, and software designers/developers students will see first hand how the social good technologies of the future get built. |