Parsing the news: Using NLP to analyze news articles

The Discovery team will develop classifiers to analyze news coverage, building on the work done in the Spring 2022 semester. This project will involve investigating and experimenting with different NLP approaches to extract information from news articles, and then classify them (using rule-based approaches and/or machine learning). Possible approaches may include pattern matching with regular expressions, part-of-speech (POS) tagging, co-reference resolution, working with knowledge graphs (i.e. Wikipedia data), etc.

This semester, we will be developing and refining models to classify the topics present in news articles, and the types of people who are quoted (e.g. elected officials, advocacy groups, religious leaders). We have labelled datasets of news coverage about voting rights, money in politics, nuclear weapons issues, and racial and religious profiling of Muslim, Arab, Sikh and South Asian communities going back around 10 years (about 220k articles total).

We currently annotate news coverage of the issues we work on by hand, and building accurate classifiers will help us to automate that process. This will allow us to provide more and better data and analysis to our partner organizations working on media and communications around progressive issues.

Term

Fall 2022

Topic

Social Sciences

Technical Area(s)

Natural language processing (NLP)

Featured

Off