Platforms/Infrastructure

Wordnik - Hyphenation Project

Wordnik provides a hyphenation API, with data licensed from traditional dictionaries. However, more than half of the unique words of English aren't in any dictionary (see http://www.sciencemag.org/content/331/6014/176). To provide hyphenation for unknown words, we'd like to implement the Liang algorithm (https://www.tug.org/docs/liang/) and combine it with a model (based on known dictionary hyphenations) that provides a confidence metric for the...

Berkeley Police Department- Simulating Alternative Responses to Calls for Service

The Berkeley City Council recently voted to audit the "calls for service" (CFS) received by the Berkeley Police Department (BPD) to determine the feasibility of transferring the response to certain types of calls to alternative emergency response agencies. Using historical CFS data provided by BPD, this project seeks to simulate the effects on response time, staffing levels, and racial disparities of alternative emergency response strategies.

The project will advance in complexity throughout the semester and perhaps through multiple semesters. We'll start with a cursory analysis of...

Lawrence Berkeley National Lab - Modeling Access Pattern of Large Scientific Data Repository with Machine Learning

Large amounts of data is archived on the tape archive (HPSS), especially for those experiments that produce large volumes of data every year. dCache system is a storage management system that has an HPSS connectivity, and transfers data from HPSS to designated storage disks when the users request. The issue is that tape mounting on HPSS to retrieve the file is an overhead for every single tape mounting, besides seeking time for the requested file on the tape and reading time for the file. From the system point of view, reading multiple files out of the same tape can increase the lifetime...

Creative Commons - Linked Commons Graph Analysis

Creative Commons has collected graph data linking different web properties that use Creative Commons licenses. Nodes are domains (rather than, e.g., individual pages). For each node, we record the number of CC Licenses and their types found at that domain. The edges (as one might imagine) are links between two domains, and are weighted by how many links there are between the two domains. We currently have a graph with about 250,000 nodes.

- How can we quantify the ‘influence’ of a node within the ‘Commons’, i.e., the set of domains hosting CC Licensed content? It would be...

Goodly Labs - Public Editor

Students on this project will analyze data from a collaborative web app guiding thousands of internet volunteers to read through the most shared news articles of the day and label evidence of misinformation in the content. Working alongside a national coalition of social science researchers and journalists, a Nobel Laureate, cognitive scientists, and software designers/developers, students will test the robustness of the Public Editor system, fortify it against attacks by trolls, and implement gamification features to ensure volunteers enjoy their experience.

Creative Commons - Image Popularity and Authority

CC Search (https://search.creativecommons.org/) is a media search engine maintained by Creative Commons which currently indexes metadata for around 500 million images. As with any search engine, one challenge is ranking the results appropriately.

- The ‘authority’ of an image is defined (for us) by the probability that someone would end up looking at a particular image, given that they’re puttering around the internet looking for images in general. For general webpages, the typical example of such a metric is...

Creative Commons - Image Clustering and metric inference

CC Search (https://search.creativecommons.org/) is a media search engine maintained by Creative Commons which currently indexes metadata for around 500 million images. As with any search engine, one challenge is ranking the results appropriately.

We have many images, but relatively few clicks on these images, and some of our images have little metadata associated with them. We would like to attempt rectifying this problem by the following steps:

1. Cluster the images based on their perceptual qualities. An...

School of Information - Evaluating Accessibility on Congressional Websites

Voters and constituents increasingly look to the websites of elected officials to answer their questions about voting or constituent services. Recent work has discovered that many of the leading presidential candidates’ websites are not accessible to vision impaired users. This project aims to discover the breadth and depth of accessibility issues on congressional representatives’ websites. Discovery students will be responsible for developing a data collection framework using existing accessibility APIs and then analyzing the accessibility of congressional and election websites using...

NHERI SimCenter - Enhancing regional scale natural hazard simulations with artificial intelligence

The NHERI Computational Modeling and Simulation Center (SimCenter) provides next-generation computational modeling and simulation software tools, user support, and educational materials to the natural hazards engineering research community with the goal of advancing the nation’s capability to simulate the impact of natural hazards on structures, lifelines, and communities. In addition, the Center will enable leaders to make more informed decisions about the need for and effectiveness of potential mitigation strategies.

The SimCenter is currently undertaking activities that focus on...

Goodly Labs - Demo Watch

The Demo Watch project has collected and is curating over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. This semester, students will work with senior researchers and professors from Goodly Labs, NYU, and the Univ. of Michigan to (1) implement/code a multi-level time-series model that will analyze curated Demo Watch data to find patterns of peaceful and violent activity; and (2) create a text classifier, via supervised machine learning, that is capable of scanning through news articles about protest to identify important data...