Fall 2018 Discovery Projects

Computational Analysis of the Social Sciences

This project studies the trends in theories of organizations in the social sciences and professional fields. Students clean the data collected from ~400,000 JSTOR articles, trace engagement of these theories over time, and work on visualizing their findings.

Analyze Air Quality Data

Analyze QA/QC'd data from the San Francisco Bay Area to determine patterns and direct modifications within the network.

Data Enabled Donations

Two tons of mostly-expired dog food. An ice rink full of unsolicited goods. Although well-intentioned, the sheer scale of physical donations during and after a disaster consistently overwhelms traditional aid organizations, local governments, and community-based organizations. How might we transition from an ‘ad hoc’ disaster donations framework to a practical data-enabled approach that empowers survivors to seek and receive support, matches demand and supply in real time (minimizing mismatch), and/or encourages more effective donation strategies and processes? This project will build upon existing collaborations through the West Big Data Innovation Hub, City of Santa Rosa, and The Salvation Army. Research could include historical social media analysis, text analysis, and studies of online financial transactions, to better understand and mitigate issues for future disasters.

The Economics of Disaster Recovery

Hurricane Katrina. Superstorm Sandy. The Joplin Tornado. The 2017 Wildfire Season cost the State of California over $180 Billion USD, and the 2018 season is set to exceed this number. How do investments after a disaster impact the economic conditions of a community? Is an “ounce of prevention” before a disaster worth “a pound of cure?” This project builds upon existing collaborations through the West Big Data Innovation Hub and leverages State, Federal, and industry data in an attempt to answer these questions.

An Unacceptable Reality: Drinking Water in California

Up to 1 million Californians are exposed to unsafe tap water at some point during the year. Some communities have been exposed to unsafe water for more than a decade. Droughts, floods, aging infrastructure, and other human and natural causes can disrupt the water supply and limit—or eliminate—access to safe drinking water for days, months, or even years. While the most publicized examples are in rural areas, delivering safe, sufficient, and affordable drinking water poses a challenge in almost every region of the state. How might we leverage data to visualize and understand the challenge of supplying safe drinking water? How might we anticipate vulnerable areas where water disruptions may occur next? This project builds upon existing collaborations through the West Big Data Innovation Hub, with the California Governor’s Office of Planning and Research, California Government Operations Agency, and other organizers of the 2018 California Safe Drinking Water Data Challenge in an attempt to answer these questions.

Data Cleaning Projects on Social Protection in Latin America

This project seeks to document, clean & prepare summary statistics using restricted use data on domestic violence in Brazil & child development in Colombia.

Survey of Air Data Platforms

As a result of lower cost air sensors and increased attention to local-scale air monitoring, more and more air quality data are being collected by a growing variety of institutions and civic organizations. Efforts to collect and manage these data are underway but fragmented. In this project, students survey the various platforms that collect and curate air monitoring data (e.g. ESDR, OpenAQ, and Air Quality Data Commons), comparing their scope and capabilities, in order to identify best practices, redundancies, and gaps. The survey will be used to inform subsequent efforts to build data infrastructure for air monitoring.

"Toxic Soup" Analysis

The Richmond Community Air Monitoring Program (RCAMP) produces extensive data about levels of toxic air pollutants and fine particulates in the neighborhoods around the Chevron refinery, but data analysis has so far been limited. New metrics for data analysis have recently been developed by researchers from the Fair Tech Collective at Drexel University. By focusing on the number of detections, these metrics tell a story about simultaneous exposures to multiple pollutants—what residents have long referred to as breathing “toxic soup.” In this project, students (a) automate the process of calculating the new “toxic soup” metrics so that monthly reports could be released on a regular basis, (b) extend the analysis back to the beginning of the RCAMP program in 2013, and (c) analyze seasonal variation in the results, especially with respect to wind speed and direction.

Understanding Air Pollution and Health in Context of Industrial Operations

Ambient air quality data is most useful to communities and policy makers when fluctuating levels of air pollution can be linked to specific industrial activities—whether particular sources like coal trains, or episodic events like flaring or turnarounds at a refinery—and/or when they can be linked to health outcomes at the community level. In this project, students devise methods to link air pollution to emission sources, industrial activities, and health, using air quality and meteorological data from the Richmond Community Air Monitoring Program (RCAMP) and other relevant, publicly available sources of information that students will find. Students will brainstorm with knowledgeable air quality activists from around the Bay area about methods for obtaining information and approaching the available data.

Exploratory Analysis of Natural History Databases

Natural history data is broad in scope and can range from species occurrence data to environmental measurements to basically any data that describes how organisms interact with each other and their environment. In the past 15 years there has been an increasing drive to digitize this data and make this data available through public databases. This project explores these vast datasets from a data science perspective.

Network Analysis of Ancient Texts

This project begins with a text corpus of 65,000 ancient Sumerian texts in an online database. It employs tools for computational text analysis in order to build a historical network that shows the geographical and social trade networks from these texts, which date back to 2000 BC. Here is a brief description of some of this project’s past results: https://iaassyriology.com/computational-cuneiform/

Charter Schools and the Business Age: Analyzing and Visualizing Web Text and School Data

How does the push to run schools like businesses--complete with performance targets, incentives, and centralization in culture and governance--shape the growing charter school sector? Which charters survive and thrive in this political climate: those that stress standards-based rigor and college-readiness (traditional model), or those that prioritize independent thinking and socio-emotional development (progressive model)? And how does this differentiation affect charter school segregation--that is, do progressive schools serve white students in affluent, liberal communities while traditional schools serve students of color in poor or conservative communities? To start, this project extracted web text from the websites of every U.S. charter school open today, and has begun parsing for quantitative data. Students on this project perform exploratory data analysis to glean any insights that could prove useful for answering these questions.

Establishing Biological Age Baseline for Kidney Tissue Health using Machine Learning

The kidney is the central organ in human body responsible for water and mineral balance. Multiple health disorders, such as diabetes, heart diseases, and hypertension, as well as aging-related processes affect kidney physiology. The aim of this project is to use computer vision and deep learning tools to predict age based on UCSF kidney histological images accounting for additional demographic data such as gender and health history. We envision two main outcomes of this project: (1) the established age prediction will serve as a biological age baseline to assess general health of the organ and (2) discovery of novel or hard-to-manually-score histological features.

COSI Machine Learning

COSI, the Compton Spectrometer and Imager, is a NASA-funded balloon-borne gamma-ray telescope, observing Galactic nucleosynthesis, Galactic positron annihilation as well as the most violent events in our Universe (supernovae, neutron star mergers) and the most extreme environments (pulsars, black holes). This project is centered on improving the data analysis pipeline of COSI by applying the latest machine learning tools to individual pipeline tasks.

Motivations to Attend Coding Bootcamps

In the past five years, there has been a surge of coding bootcamps. Their purpose is to prepare graduates for jobs as programmers in the tech industry. Surprisingly, the percentage of women graduating from coding bootcamps is much higher than would be expected, compared to those graduating with bachelor’s degrees in computer science or working in the tech industry. Why do women choose to enter computing now, often after having received a degree in another non-computing field? This research will use online blogs to investigate the motivating factors and reasons behind why people choose to attend coding bootcamps.

A Sumerian Network

Administrative documents from the city of Puzriš-Dagan (modern Drehem, southern Iraq), written in Sumerian cuneiform on clay tablets, more than 4000 years old reveal a dense network of officials, members of the royal family, foreigners, and gods. Who are the central people in this network, what cliques can we identify and who are the bridges between those cliques? This research will attempt to answer these questions.

Radio Signal Identification and Classification

Breakthrough Listen is the world's leading search for extraterrestrial intelligence (SETI), using some of the world's most powerful telescopes in a $100 million survey of planetary systems around nearby stars for indicators of technology. Its rich dataset runs to many petabytes and includes a wide range of signals and modulation types generated by human technology. Students develop code to extract and classify features in large arrays of image-like data, looking for outliers, and apply a variety of approaches including t-SNE, GANs, CNNs, LSTM, etc. to help Breakthrough Listen to understand the wide variety of radio frequency interference while also aiding in their quest to find signs of intelligent life beyond Earth.

Public Editor

Public Editor has been working on developing an online editing system. This system would allow the public to rate news sources for their credibility to discourage the circulation of fake news. Public Editor hopes to eliminate false information being provided to the public through this project.

Liberating Archives

Opening up Public Archives for open research, students learn how to gather, parse, and publicly share digital archives that are currently inaccessible for research purposes. Using the Government Publishing Office’s Congressional Hearing Transcripts as an example, this project guides participants through their own data liberation project, in which they: scrape the web for document files while retaining document metadata; programmatically find and extract meaningful data objects within the documents; link those objects to external databases; prepare all this compiled textual data for computational analysis in R and Python; and host their newly formed database so that the public and other researchers can launch their own studies of the data.

Deciding Force

Discovering Patterns of Peace and Violence between Police and Protesters Students on this project will use a collaborative web app to process the information in over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. Data processed by students will be used to find patterns of peace and violence, which can be used to scaffold broad public conversations, and shift the behavior of police and protest strategists. Students work will be used to create artificial intelligence able to understand dynamics between cities and movements, and recommend policies more likely to result in peaceful and effect political expression.

Bloodlines: Children’s stories and the transmission of racism in 20th century Europe

This project aims to explore the local level transmission of racism through the study of children’s stories. In 20th century Europe children’s stories often aimed to discipline children’s behavior through the inducement of fear. These so called “Kinderschreck” stories often featured rather innocent depictions of fantasy figures or animals that acted as bogeymen. In some European villages, however, bogeymen took more racist forms such as that of the “Forest Jew” or “Big black man”. Exploiting fine-grained village level data gathered by folklorists throughout the Netherlands, Belgium, Germany, Austria and Switzerland, this research tries to explain why stories were more prevalent in some regions than others and what consequences this had for the widespread decay of pluralism in interwar Europe. RA's will be asked to help with the digitalization and geo-coding of themes in Children stories.

Investigating Slow Earthquakes in the Pacific Northwest

Under the guidance of Dr. Noel Bartlow, this project aims to analyze and visualize slow earthquakes in the Pacific Northwest region of the U.S. These slow earthquakes occur on the large Cascadia megathrust fault, and release energy equivalent to a magnitude 6.8 earthquake over a few weeks. The Cascadia megathrust fault runs under the Washington and Oregon coastlines, along with parts of northern California and British Columbia, and is capable of producing a magnitude 9 earthquake, similar to the devastating 2011 Japanese earthquake. Students in this project analyze and visualize this data in order to extract features of interest -- particularly systematic and statistically significant spatial or temporal changes over time.

Visualization platform for earthquake catalog completeness

This project aims to build an application to visualize the earthquake catalog completeness from various sources (i.e. USGS, Japan catalog, etc.) in order to have an open source tool that communities could use. The completeness of a catalog is the minimum magnitude above which all earthquakes within a certain region are reliably recorded.

Editing Madame Bovary: How Flaubert Composes his Novel

This project aims to track changes throughout various drafts of Madame Bovary to document the writing practices and influences of 18th and 19th century authors on Gustave Flaubert’s writing.

Causal Analysis to solve Keynesian uncertainty cases, and subsequent representation using MatLab

The causal framework and its representation as a computer program are important for certain Data Security cases involving utter (Keynesian) uncertainty. We have dealt successfully with those cases, as well as the cases with uncertainty involving some prior probabilistic knowledge.

Design of Optical Character Recognition System for Ambulances

Green corridor is the name of a system where the roads are said to be made free by officers so that an ambulance can pass easily. But this can be difficult to implement in places with large amounts of traffic. In a situation where every second counts, this project aims to improve medical transportation through enhanced interfacing with traffic controls. Through using optical ML classification algorithms, this project aims to help officers coordinate green corridors in advance by using traffic cameras as a tool to recognize when an ambulance is on the road.

Seismological Data-Driven Approach for Northern California Tectonic Stress Measurements

This project seeks to evaluate spatial and temporal evolution of tectonic crustal stress fields to further understand the underlying mechanisms of earthquake nucleation and triggering processes. The focal mechanisms of earthquakes provide crucial input on the in-situ stress state at seismogenic crust. The Northern California Earthquake Catalog (http://ncedc.org/) archives over 1.6 million earthquakes in northern (and central) California (from 1966 to present). In this project, students compute spatially and temporally varying stress fields for the entire northern California. Outcomes from this project will contribute to a new northern California stress map, which will improve the seismic hazard model for northern California.