The Demo Watch project has collected and is curating over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. This semester, students will work with senior researchers and professors from Goodly Labs, NYU, and the Univ. of Michigan to (1) implement/code a multi-level time-series model that will analyze curated Demo Watch data to find patterns of peaceful and violent activity; and (2) create a text classifier, via supervised machine learning, that is capable of scanning through news articles about protest to identify important data for analysis.
The Demo Watch project has collected and is curating over 8,000 news articles describing all the interactions between police and protesters during the Occupy movement. Our team strives to better understand the factors that cause some of these encounters to become violent and others to be peaceful. With this understanding, we could help avoid the use of force and ensure that people can safely exercise the right to demonstrate.
Data Collection and Organization
The newspaper articles we used as our data source describe Occupy events across the United States. The project is also in the process of collecting data from citizen-scientists who read and answer questions about the articles to help us understand what happened in each city: How many protesters were present? How many police officers? Where did the event take place? Were there arrests or injuries? Our goal was to use this data to model events and determine the factors influencing how they turned out the way they did. We focused on the newspaper text and metadata to accomplish this, and our code is demonstrated here.
To organize our data into a more workable format, our team created classes to represent articles and text units of analysis (TUAs). Each article object contains the TUAs associated with it as well as other important information including the city, event date, and other metadata. The TUAs are text snippets from our collection of news articles.
One of the algorithms we implemented this semester calculated the date of the main event described in each article. By observing the date patterns seen in a sample of articles, we noticed that the articles were either written a few days after the event, in which case it referred to the day of the week of the event, or on the day that the event actually occurred. Therefore, we devised an algorithm that attempts to find the day of the week and subtracts the distance from the article publishing date to return the event date. If this is unsuccessful, it returns the date of the article.
Since there are multiple reports on the same event, another algorithm we developed attempts to canonicalize these data into unique instances — essentially a pairwise deduplication task. To do this our team used the Article class’s event_date and city attributes to group the articles and determine if a given article describes the same event as another.
Our project heavily relies on instances of user-generated TUAs, or textual units of analysis. Within our investigation, we employed several utility features that will parse article text and the annotations file, as well as loading each article into a Pandas DataFrame. These aforementioned methods allowed us to quantify certain characteristics of each TUA, including its length and its category.
Analysis of the TUAs from each article led to several key results. Unsurprisingly, the words that appear most frequently across all articles included said, Occupy, city, protesters, and park.
Because the scope of our investigation lies within the Occupy movement -- a socio-political movement that expresses opposition to social and economic inequality -- it follows that the words most frequently picked out by users are those above. Moreover, the bar plot within the Jupyter Notebook shows the breakdown of TUA counts for each category. Clearly, the most popular TUA category is useless, followed by Strategy, Camp, and Protester. Surprisingly, one of the least frequent categories was police. One potential reason for this is the lack of police perspective within news articles and media.
Limitations and Future Goals
One of the challenges we faced in this project was working with a different set of data than we had originally intended to. Since the process of using citizen scientists to annotate and extract information from the articles was happening in real time this semester, we had to make do with what we had — namely, the article text and metadata. Through this process, we learned that data sourcing is not as straightforward in the real world and had to adapt to the situation.
Ideally, in the future, the article text for this project will be fully annotated and we will be able to use that work to expand on our article classes further. Among the attributes we could add are number of arrests, number and severity of injuries, level of police activity, and timing of actions. We would then be able to combine this with our pairwise matching algorithm to find the characteristics of different events. The goal then would be to join this data with that of city characteristics (police training procedures, local government political leanings, etc.) to build models that allow for the prediction of peace and violence at these events.
Joe Cannice, William Furtado, Vyoma Raman, Siyi Wu