Berkeley Data Science Students, Journalism Faculty Building Public Database of Police Misconduct

David Barstow

David Barstow

Arda Demirci

Arda Demirci

Hannah Perlstein

Hannah Perlstein

December 20, 2021

Three years ago, California state law SB1421 took effect, making police records related to officer use-of-force incidents resulting in serious injury or death, sexual assault and acts of dishonesty accessible under the California Public Records Act. The bill was sponsored by various advocacy groups, including the California News Publishers Association.

But achieving the transparency embodied in SB1421 is challenging, both in collecting the information and making sense of the data. At UC Berkeley, the Berkeley Institute for Data Science (BIDS), the Data Science Discovery program and the Graduate School of Journalism are working with the California Reporting Project to help develop and automate a statewide database of the records that can be easily accessed by news organizations and the general public. The database project began in 2018 and is funded with a $1 million grant from Sony. 

Lisa Pickoff-White, an award-winning, data-driven reporter for San Francisco public television station KQED, is on a one-year fellowship with the Big Local News program at Stanford University, guiding the database project, working with BIDS, Discovery program students and other organizations to request use of force and police misconduct records from about 700 law enforcement agencies across California and develop tools to extract, analyze and make accessible the data from more than 100,000 files.

According to David Barstow, who leads the Investigative Reporting Program in the Journalism School, there is no easy way for either journalists or the public to get the information.

“We are figuring out how to ingest all this information, make sense of it and get it into an accessible form,” Barstow said. “It’s a significantly difficult data collection problem.”

Not only are there hundreds of pages, but the files include typed reports, hand-written notes, audio recordings and body cam footage. Pickoff-White said she even received data on a CD-ROM, which was stapled to a paper report.

“There is no consistency and every department has its own record-keeping system,” Barstow said. “It’s like all the info has been put in a blender. It’s a mess.”

Extracting relevant data

To sort it out, the data from each report is manually entered into the developing database two times, each by a different student. The two versions are then compared by an editor who checks for and resolves any discrepancies. The results from each law enforcement agency are then assembled into a database. When complete, those databases are being made available to local journalists.

The information is also being used by Berkeley Data Science Discovery program students Arda Demirci and Hannah Perlstein worked as a team to train the system on how to reduce or replace the manual steps. Demirci, a junior majoring in computer science and data science, has been coding scripts to extract such data as case number, names of those involved and dates from submitted reports. Perlstein then converted that data into tables, which could be searched by journalists. Scripts were written to extract data from both formatted PDFs and transcripts of oral recordings. Although a human eye would likely find it easier to glean information from the PDFs, the transcripts provide more context and are easier for scripts to parse, they said. Although not a member of the Discovery program, fellow student Selina Kim also contributed to the effort.

The next step for Demirci and Kim is to help build a synthesizer, a machine learning-guided tool that will decide which script should be used to best answer a journalist’s query. “For me, this is the more exciting stuff and I think we’ll do some serious work here,” he said.

Demirci applied to work on the project both for its technical challenges and because it’s very meaningful socially. For Perlstein, it’s more personal. Her grandfather was a prosecutor for the Los Angeles County District Attorney’s Office and her father still works there in that role and she has discussed the project with him.

“He finds it very interesting and thinks it’s phenomenal that I’m working on it,” Perlstein said. “In the end, his goal is to serve justice – in the pure sense – and see that there is accountability.”

Now a junior, Perlstein enrolled at Berkeley intending to major in political science. But after taking Data 8: The Foundations of Data Science, Berkeley’s introductory data science class, she switched her major to data science.

“I love numbers and realize you can do anything with data science, from helping society to working for corporations,” Perlstein said. “When I applied to this project I wanted to make a societal impact and we are. I’m also developing new skills and can definitely see my skills grow as I work on this project.”

Like Perlstein and Demirci, sophomore Pruthvi Innamuri was attracted to the effort for both the technical and societal implications. Working with his advisor, Hellina Nigatu, Innamuri and fellow student Jason Cheng worked with 1,200 files from one police department to develop a natural language processing (NLP) model for automatically extracting specific data points from the reports. 

The team started with an Optical Character Recognition model to convert the scanned case documents to .txt files. Then, the students wrote a Python script to tag relevant terms such as names, dates, locations, etc. for generating a training database. These cases ranged in length from half a page to 144 pages, covering a variety of case topics. Afterwards, the training data was applied to a Name Entity Recognition model. The results were good, so they began the much longer process of creating a more extensive database to train the model to automatically tag the key terms.

“We achieved good results, but our model only worked for that department’s reports and wasn’t generalizable to other formats,” said Innamuri, who is majoring in computer science and data science. “We would need far more training data to do that, but it shows you need machine learning to be able to handle changing formats of the reports.”

His team is wrapping up the project, identifying a cost-effective NLP service with Name-Entity-Recognition capabilities that would also be easy for journalists to use.

Calling the project both super-ambitious and incredibly difficult, Barstow praised the contributions of the Berkeley students, saying they are “making incredible contributions. We couldn’t be happier with their enthusiasm, expertise and brainpower.”

Cross-bay collaboration

Across the bay from Berkeley, Stanford Journalism Professor Cheryl Phillips is collaborating with Pickoff-White and Barstow on the California Reporting Project, which now includes 40 different news organizations across the state. She also teaches students how to access and analyze data as part of their reporting. During the spring semester, students from Barstow’s program in Berkeley virtually attended Phillips’ Big Local Journalism class, in which students use projects to tackle data-driven journalism.

As at Berkeley, students at Stanford have also been trained and are manually entering data from various police departments. In the spring, a prototype of the database was used to document use of force by Bakersfield police. A report co-authored by Pickoff-White found that between 2016 and 2019, Bakersfield police officers used force that broke at least 45 bones in 31 people. In all 31 cases, which included a bicyclist stopped for not having a headlight at night and a curfew violation in a city park, the Bakersfield Police Department determined that none of the officers involved had violated departmental policy.

To access the data, Pickoff-White wrote the code and some Berkeley and Stanford journalism students added to that analysis using Python and R. Nithin Chalapathi, a Ph.D. candidate in the Department of Electrical Engineering and Computer Science, is also working on a tool to help reporters identify trends in data or search without coding.

“Since code is reproducible I'm writing recipes for people to follow in the future,” Pickoff-White said. “For instance, Harriet Rowan, a data journalist at the Bay Area News Group, built on that work for a series coming out soon covering  another jurisdiction. And I'm further refining the recipes as we analyze more places.”

Phillips, who earned two team Pulitzer Prizes as a newspaper reporter and editor, said that with shrinking staffs and fiscal constraints, news organizations – especially newspapers – don’t have the resources to dig through local data for stories. The project is now addressing that by creating databases on a city-by-city basis, with data from San Jose soon to be released and another database for San Diego under construction.

Meanwhile, a backlog of reports dating from 2014 still waits to be entered and more are still coming in, albeit in fits and starts as various agencies decide how to respond to the requests.

Although news organizations are the target audience of the databases for now, project leaders believe it will also be used by defense attorneys, community organizations and individuals. To make the information more easily accessible, a public-facing dashboard is being designed.

Arlet Miranda, a junior from San Jose who is also a member of the Discovery team, is analyzing data from a survey of Bakersfield residents to determine what kind of information the community would like to find via the dashboard. Badge number? Number of incidents an officer has been involved in?

To get a broad perspective, the survey is focusing on two segments of the population: those who have been stopped by Bakersfield police and those who know someone who works in law enforcement. The interviews are being conducted in person and Miranda said the work is slow going but fulfilling and she plans to return next semester.

She got interested in computer science in high school when she took her school’s first course in the subject and used block coding to create animations. Although she originally planned to major in computer science at Berkeley, she is now majoring in data science with a domain emphasis in applied math and modeling.

Looking ahead

As journalists who have worked in other states, Barstow, Phillips and Pickoff-White hope that the database development work being done in California can also be adapted for use in states where police records are being made public, including New York, Illinois and Louisiana. Maryland lawmakers are currently battling over the issue as well, Phillips said. She also thinks it could have an even broader effect.

“If police departments know that anytime force is used the information may be released and scrutinized using these databases, we may see more policy changes and increased training in how officers should react in different situations,” Phillips said. “The work that the students are doing to drive this project is amazing and on the cutting edge of things, what can we do with machine learning that hasn’t been done before.”

Pickoff-White said that the resulting database and dashboard could also make it easier to identify officers who are fired for use-of-force incidents resulting in serious injury or death, sexual assault or acts of dishonesty so it’s not so easy for them to get hired by another department.

“If used in a careful and meaningful way, it could make a real difference and hold political leaders accountable,” Barstow said. “We’ll be able to ask questions about how law enforcement operates in California and get solid, reliable information. That’s what we’re excited about.”

The California Reporting Project shares some resources and staff with the Community Law Enforcement Accountability Network (CLEAN), a collaboration between UC Berkeley and the criminal defense lawyers association. In August, a group of Berkeley researchers won a 3-year, $2 million National Science Foundation grant to improve the useability of big criminal justice datasets for public defenders and others. The new Effective Programming, Interaction, and Computation with Data (EPIC) Lab is creating computing tools to help defenders, investigators and paralegals without coding expertise more easily research police misconduct, judicial decision-making and related issues for their cases.

Pruthvi Innamuri

Pruthvi Innamuri

Lisa Pickoff

Lisa Pickoff

Cheryl Phillips

Cheryl Phillips

Arlet Miranda

Arlet Miranda