Data Science Discovery Projects
Data Science Discovery Projects, a joint effort of the Berkeley Institute of Data Sciences and the Division of Data Sciences, provides undergraduates more opportunities to engage in hands-on, team-based discovery opportunities leveraging data science by connecting them with cutting-edge graduate student/postdoc research projects, community impact groups, entrepreneurship ventures, and educational initiatives across UC Berkeley. Data science is an intrinsically interdisciplinary field with broad reach, fast scaling capacity, and a large pool of interested students and projects. Students can earn units through Undergraduate Research Apprenticeship Program (URAP).
SPRING 2018 PROJECTS
Center on Wage and Employment Dynamics (CWED)
CWED is seeking for students to support our research this semester on the effects of state and local minimum wage policies on health and other outcomes. Research would involve processing large micro datasets (review codebooks, import raw data, clean and reshape for analysis), compute, tabulate, and graph descriptive statistics, run simple regression models. Familiarity with basic statistics and econometrics (e.g., means, variances, descriptive statistics, basic regression models). Experience with Excel and Stata. (If no one is available to work with Stata, R would be acceptable). Interest in economics (especially labor) or health issues is a plus.
Center for Community Innovation - Travel Patterns of Residents at Affordable Transit-Oriented Developments
As part of our efforts to promote sustainable and equitable growth, the Center for Community Innovation is currently studying the environmental and other co-benefits of locating affordable housing near transit. This project aims to fill current gaps in the literature by examining the relationships between affordable housing, proximity to transit, and travel patterns through primary data collection and analysis. Sponsored by the California Air Resources Board, this study will help to inform the future of state and local policy pertaining to affordable transit oriented developments. Primary GPS and survey data, which is being collected by a team of graduate and undergraduate researchers, will be analyzed to determine the travel patterns (e.g., distance, duration, mode, etc.) of residents of affordable housing developments both near and far from rail transit stations. Students will analyze GPS data collected with the e-missions app. URAPs will be tasked with conducting data exploration and cleaning using/adapting existing Jupyter notebooks. Work will also involve the creation of data visualizations and maps to further explore, clean and analyze the data. Applicants should be proficient in Python and GIS.
Berkeley Initiative for Transparency in the Social Sciences/Economics - Academic Publication Transparency
In this project we investigate the role that transparency of data and code play for academic publications, and seeing what incentives exist for researchers to do better science. In academia, the stakes to publish successfully are high (''publish or perish''), and have led to various negative side effects such as publication bias, p-hacking, and even fabrication of data. In 2010, as a step towards better scientific practice, the editor of one of the top journals, The American Journal of Political Science, began to require all publications provide the data and code necessary to reproduce their claimed results. Other leading journals (e.g., the American Political Science Review) had no such requirement. We want to investigate how the prominence of a publication, as measured by citations, is affected by whether the authors publicly share the underlying data and code. There is already strong evidence that data sharing is correlated positively with citations, but we are interested in finding a causal effect. The change in policies for some journals and the lack thereof in others will hopefully allow us to identify this as a causal effect rather than just a correlation. We have already gathered the data for two political science journals, but we need to finish gathering the data from two economics journals (The American Economic Review and The Quarterly Journal of Economics) and do a considerable amount of data cleaning and analysis. For this project we are looking for students who are enthusiastic about research transparency and reproducibility and making a difference in the academic publishing system.
Berkeley Initiative for Transparency in the Social Sciences/Economics - Specification Curve Analysis
Run a Specification Curve analysis of multiple published economics papers. (See paper here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2694998(link is external))
That means we'd have to select a bunch of econ papers with publicly available data and code, read the papers to determine all the points of analytical flexibility in the analysis, and then re-run the complete set of all possible combinations of choices. This would be a lot of detailed paper reading and regression analysis. This would primarily be in Stata, with maybe a tiny bit of R, since that's what the original authors likely used. Students will learn data collection, automation, and validation. Students will be exposed to a variety of modern research papers in political science and economics and gain a first-hand experience of their subject matter, level of reproducibility, and some of the statistical methods involved. The team will adapt existing programming scripts to clean data from additional journals, manually categorize article characteristics, automate cross-checking and validation of entered data, then visualize and also analyze the data using regression analysis. Seeking students who know Pthon extremely well and can handle the sort of data construction stuff I need, so once everybody pitches in and finishes the manual data entry, in addition to that I need basic data analysis and visualization in R or Stata.
Space Sciences Laboratory (SSL)
We are currently working on evaluating the performance of a concept for a SmallSat astrophysics space mission to map the inner Milky Way in low-energy gamma rays and to observe the polarization form pulsars, AGN, etc. One key open question is how good we will be able to identify and suppress gamma-ray background originating from internal radioactive activation and subsequent decays as well as from the Earth’s albedo emission. Applying machine learning techniques to identify the background and distinguish it from source gamma rays will be the key topic of this project.
Berkeley SETI Research Center
Breakthrough Listen is the world's leading search for extraterrestrial intelligence, using some of the world's most powerful telescopes to survey planetary systems around nearby stars for signs of technology.
- Develop code to extract features from large arrays of image-like data.
- Visualize and explore patterns in the data using approaches such as PCA and clustering
- Apply unsupervised and supervised approaches to reject human radio frequency interference and hone in on signals of interest
- Develop interfaces allowing for labeling of data by volunteers
Skills: Python, Jupyter, D3, C++, CUDA, SQL, Unix, VM and cloud experience, and data visualization
The startup is building Latin America’s first digital energy marketplace, fostering both energy governance transparency and creating a valuable product using the large amount of data we’ve been collecting in Latin America. I’m attaching the position in case any of your students are interested in working with us. Students needs significant experience with Python & serious web scraping skills.
MindsDB’s goal is to make it simple to apply Deep Learning to existing problems. The reason we are doing this is that evidence shows that in most cases deep learning methodologies outperform intuition base models and traditional machine learning by a significant margin, yet there are many hurdles to implementing deep learning correctly. As a solution, MindsDB automates the steps needed to get high-quality predictions from most data. We essentially turn databases into an automated data science team. MindsDB is seeking students to building neural networks that can architect other neural networks that can best solve a problem. Building a system that can allow a neural network not only to solve a specific problem, but to explain why its making the decisions that it makes. Students should have taken CS188 (https://www2.eecs.berkeley.edu/Courses/CS188/(link is external)) or have relevant skills.
Currently, for each AI vertical there's a dozen of cloud service vendors with different APIs and unclear performance comparison. We provide a single interface to all of them and dispatch requests to the best vendor based on the data and client needs. This way, our clients have the optimal solution of their AI tasks with just a single integration and contract. We have this running for Machine Translation and soon launching Sentiment Analysis. The role will involve adding new AI verticals to our platform such as Entity Recognition, Text Summarization, OCR, ASR etc
SimpleWater is a CITRIS startup that provides water quality analysis and treatment solutions for individuals and commercial customers. As part of our water research, we're building out the technical capability to monitor water news at web-scale using crawler APIs, NLP, and big data techniques. SimpleWater is seeking student researchers to understand and catalog web-scale data from news sites, blogs, message boards, etc. with the intent of extracting conversations about water and water quality issues specifically. Assist in creating a system for monitoring these in real time. Students should have knowledge of Python, Statistics, Natural Language Processing is preferred). We are flexible on the level involvement but to make this worthwhile for us we'll need at least 12 student-hours per week of involvement.
Bioxplor - Augmented Intelligence Platform for Life Science Discovery
BioXplor is an augmented intelligence platform for data-driven hypotheses generation & novel predictions in biomarker & drug discovery. The virtual discovery platform is built on the intersection of big data, artificial intelligence and high performance computing, and works by enabling accelerated data interpretations from unstructured literature & structured multi-omics data. In house & external projects are focused on AI-assisted biomarker discovery, molecular target identification and drug repurposing, presently in cancer and rare disease.
Skyalert provide early earthquake warning services to consumers and businesses in Mexico. SkyAlert operates the only private network of early earthquake warning sensors in America, covering over 280,000 Sq Km of earthquake-prone areas. When an earthquake occurs in the epicenter, we send an alert via mobile devices to millions of users in the areas that will be impacted by the earthquake. In the case of business customers, we send the alert through the hardware equipment using IoT technology that we have designed and manufactured. We can warn our users 10-120 seconds before the impact, depending on the location of the epicenter and distance traveled by the tremor. This allows users to evacuate and/or locate yourself in safety zones. The role will involve analyzing Earthquake Data coming in from the sensors, to determine the best way to issue Early Earthquake Warnings. Students should have understanding of data science, math, seismology's basic concepts, and computer science.
Energy Resources Group - Electric Systems Modeling in Multiple Scales for High Performance Computing
The project is to develop modular multi-energy system simulation platform in collaboration The National Renewable Energy Laboratory (NREL) and Los Alamos National Laboratory (LANL). In this work, a new framework to model and study the effects of renewable energy at multiple time scales and in different sectors will be developed. The current practices in renewable energy integration studies don't allow analysists to perform integrated assessments and require ad-hoc mixtures of software packages developed by different vendors, making the evaluation costly and in many cases incomplete. Moreover, current tools do not reliably integrate electric power systems modeling with other sectors that in the recent years have acquired great relevance to advance the integration of renewable energy, such as natural gas and water networks.
The final objective is the development of a platform to model and analyze the interaction of renewable energy sources other sectors considering multiple timescales. The platform will be developed as an OpenSource tool available to operators and analysts to reduce the costs and technical challenges of performing a thorough assessment of renewable energy integration. The project will be coded in the programming Language Julia with a focus in High Performance Computing Applications. The job responsibilities include the development of code to build the models and the produce the accompanying documentation for future development.
Energy Resources Group - Energy Systems Modeling of Mexico (SWITCH- Mexico)
Evaluating Air Quality from Big Data - This is a big data problem around the evaluation of air quality data from transportation and one that will require a strong time commitment. Your tasks will involve: (a) literature review, (b) database scripting, (c) statistical analysis, (d) visualization and (e) proposed future work. Qualifications: We are looking for highly motivated and passionate students proficient in python, SQL, GIS, and/or postGIS who have the time availability to immerse in this project.
Energy Resources Group - Quantifying the Role of Partisanship in Rooftop Solar PV Adoption in the U.S.
Support or opposition for renewable energy policies often take on a partisan character in U.S. politics. However, solar PV and supporting state and federal policies benefit consumers across the political spectrum. In turn, the benefits of rooftop solar PV policies create interest groups apart from ideological leanings. This has important implications for the politics of solar policy, going forward. Hence, this project seeks to quantify role of political voting behavior on adoption of solar PV.
We are looking for a student to compile, clean, and merge geospatial (GIS) data for each state’s voting data, Solar PV installation data, U.S. Census Bureau income
Student will gain experience in data management with multiple geospatial datasets, merging and joining spatial data across scales, and applying Python scripting language to automate and document workflows. Student will have the potential for authorship if they show dedication to the project and substantially contribute to the different stages.
Qualifications: We are looking for a highly motivated student, interested in solar PV or renewable energy politics. Strong organizational skills and attention to details are required. Knowledge of python and some GIS background required. Experience with arcpy, ArcMap, and QGIS preferred.
Energy Resources Group - Low-cost Monitoring to Improve Grid Reliability
Frequent electrical power failures drastically reduce quality of life and weaken the economic opportunities enabled through access to the electric grid. High-resolution measurements of the frequency and duration of power outages are critical to understanding and improving grid reliability. These measurements are also necessary to study deeper economic and socio-economic questions about how unreliable electricity impacts economic development and growth. But, due to the lack of appropriate monitoring technologies, it is not economically feasible for many utility companies in the developing world to gather this data. This projects builds on a long partnership with the Zanzibar Electricity Corporation (ZECO), the sole electricity distribution utility in Unguja, Tanzania, to deploy the second generation of our novel, low-cost power grid monitoring system. This system monitors the location, duration, and scope of problems at low-voltage, distribution level of the grid in near real-time. Over the course of our proposed pilot, we aim to both measure and improve the reliability and accuracy of our system. Additionally, we plan to leverage our connections with ZECO to co-develop an interface that allows data from our pilot to directly help inform utility grid maintenance.
Jupyter Discovery Engine
UC Berkeley Library, Division of Data Sciences, and the Berkeley Institute for Data Sciences are looking for student developers to create Jupyterhub extensions to facilitate discoverability and reuse of datasets for research, course modules, and assignments along with contributing to reproducibility newly published research.
Skills Required: Python, API Experience, Apache, Github, SQL
UC Berkeley Business Intelligence Project
Berkeley's current applications are not the best when it comes to building ad-hoc queries to answer simple high-level questions like "What is current employee head count of economics department". The campus would love to have a BI tool where ad-hoc queries can be built in an easier and simple way to answers some of the critical questions.
We want to build an interactive ad-hoc BI Reporting solution where users can get high-level data with a simple search query against the data source (Oracle Database, Flat File). The main aim of this application will be to simplify the way users build ad-hoc queries using key words (similar to twitter hashtags) such Year, Month, Department, Employee Count, Average Years of Service, Enrollment, Revenue, Budget etc. Application search/query page we want to display some example keywords and canned Queries. The system should be able to identify the key words and if a not an exact match system should be able to recommend the best possible match.
Liberating Textual Data: Opening up Public Archives for Open Research
Students on this Data Science Discovery Project will learn how to gather, parse, and publicly share digital archives that are currently inaccessible for research purposes. Using the Government Publishing Office’s Congressional Hearing Transcripts as an example, the project leads will guide participants through their own data liberation project, in which they will: scrape the web for document files while retaining document metadata; programmatically find and extract meaningful data objects within the documents; link those objects to external databases; prepare all this compiled textual data for computational analysis in R and Python; and host their newly formed database so that the public and other researchers can launch their own studies of the data. Participants will work together on group assignments built around a textual source of their own choosing, each culminating in a new, final database, a research paper, and its presentation. 3 units. Prerequisites: beginner familiarity with Python or R.
PublicEditor: The Citizen Science Solution to Media Misinformation
Students on this project will help develop and test a collaborative web app guiding thousands of internet volunteers to read through the most shared news articles and find evidence of misinformation in the content. Working with a national coalition of social science researchers and journalists, a Nobel Laureate, cognitive scientists, and software designers/developers students will see first hand how the social good technologies of the future get built. And students’ individual efforts will be essential to the creation of a world where the public has confidence in its capacity to discern truth from fiction and fact from opinion. 2-3 units. Prerequisites: none.
DecidingForce: Discovering Patterns of Peace and Violence between Police and Protesters
Students on this project will use a collaborative web app to process the information in over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. Data processed by students will be used to find patterns of peace and violence, which can be used to scaffold broad public conversations, and shift the behavior of police and protest strategists. Students work will be used to create artificial intelligence able to understand dynamics between cities and movements, and recommend policies more likely to result in peaceful and effect political expression. 2-3 units. Prerequisites: none.