Industry/Economics

NLP for Cannabis Text Data

For this project research apprentices will use Python to write code scraping, munging, and classifying product data to better understand the dynamics of the United States cannabis industry. Apprentices will apply their programming skills to 1.) scrape product data from publicly available websites; 2.) turn messy unstructured data sets into shiny clean data sets available for reproducible research, and 3.) apply the latest techniques in natural language processing to find trends and patterns in product description data. These data science techniques will help us uncover the political and...

Creative Commons - Linked Commons Graph Analysis

Creative Commons has collected graph data linking different web properties that use Creative Commons licenses. Nodes are domains (rather than, e.g., individual pages). For each node, we record the number of CC Licenses and their types found at that domain. The edges (as one might imagine) are links between two domains, and are weighted by how many links there are between the two domains. We currently have a graph with about 250,000 nodes.

- How can we quantify the ‘influence’ of a node within the ‘Commons’, i.e., the set of domains hosting CC Licensed content? It would be...

East Bay Community Energy - Evaluating Alameda County CO2 Emissions and Optimizing Customer Programs Using Marginal Emissions Data

East Bay Community Energy’s goal is to provide cleaner and cheaper energy to East Bay Cities and customers. The goal of this project is to answer the following questions: 1. Compare CO2 accounting methods between a locational/market-based approach versus a marginal approach for EBCE electricity load 2. Optimize electric vehicle (EV) and demand response (DR) schedules to minimize emissions and energy costs 3. Evaluate whether more accurate energy demand forecasts lower CO2 emissions.

View project submission...

WITI@UC, CITRIS and COE - Visualizing Women in Tech Data at UC

As part of continuing research in to the state of women in tech, WITI@UC, would like to enlist students to visualize data from CalAnswers and UCOP to show percentages and changes over time in the participation of women in tech fields, and specifically the participation and persistence of women on the Berkeley campus in various STEM majors. Research will look at intersectionality and potential opportunities for effective interventions to retain diverse talent in STEM majors. Participants will help define the research questions and will be FERPA...

UC Berkeley School of Law - Indigenous Brands and Social Movements

Have you noticed how many brands in the marketplace use indigenous/Native American - oriented terms and imagery? From the notorious Washington Team to the logo on Land of Lakes butter, indigenous imagery and appropriation is everywhere, and worthy of serious study and data analysis. This project aims to bring the work of social movements and activism to a data-intensive study of First Nations brands and trademarks in order to chart a path forward. In this project, we will study how many of these terms are owned by native vs. non-native entities, study how these brands have changed over...

Creative Commons - Image Popularity and Authority

CC Search (https://search.creativecommons.org/) is a media search engine maintained by Creative Commons which currently indexes metadata for around 500 million images. As with any search engine, one challenge is ranking the results appropriately.

- The ‘authority’ of an image is defined (for us) by the probability that someone would end up looking at a particular image, given that they’re puttering around the internet looking for images in general. For general webpages, the typical example of such a metric is...

Creative Commons - Image Clustering and metric inference

CC Search (https://search.creativecommons.org/) is a media search engine maintained by Creative Commons which currently indexes metadata for around 500 million images. As with any search engine, one challenge is ranking the results appropriately.

We have many images, but relatively few clicks on these images, and some of our images have little metadata associated with them. We would like to attempt rectifying this problem by the following steps:

1. Cluster the images based on their perceptual qualities. An...

NHERI SimCenter - Enhancing regional scale natural hazard simulations with artificial intelligence

The NHERI Computational Modeling and Simulation Center (SimCenter) provides next-generation computational modeling and simulation software tools, user support, and educational materials to the natural hazards engineering research community with the goal of advancing the nation’s capability to simulate the impact of natural hazards on structures, lifelines, and communities. In addition, the Center will enable leaders to make more informed decisions about the need for and effectiveness of potential mitigation strategies.

The SimCenter is currently undertaking activities that focus on...

The Tempest Media - AI Content Management System for The Tempest Media

The Tempest is a digital publishing platform that has published hundreds of articles over the past 5 years. This project will focus on building a machine learning model that examines social media trends and suggests which pieces of content to repost. The aim is to find the most relevant content for trending topics, resurface the content and determine the optimal times in the day to post each content.

The Tempest is a global media publisher, with an emphasis on personal narratives. The challenge will be not only repurposing content that is popular with Trends in the USA but also...

Powerside - System Telemetry Analysis

Powerside wishes to develop value condition monitoring algorithms for high-value systems with generally high consequential costs of failure... industrial, medical, transportation, government, telecoms, for example. This requires setting up a modeling and operational model management framework within the AWS environment, and developing unsupervised and supervised models to anticipate and to isolate fault conditions on diverse data streams. Particular emphasis is on making a system which can be seamlessly introduced into our fleet operations.