The goal of the research project is to build a citation network from any collection of PDFs. The project data mines a large collection of sources from the disciplines of ancient Near Eastern Studies, Classics, Archaeology, and Middle Eastern Languages. The results of the project make this collection more internationally accessible for research by scholars in these fields by creating novel tools for computational textual analysis.
About the Researcher
Dr. Adam Anderson received his PhD from Harvard in the ancient languages and civilisations of the ancient Near East. His research focused on the biggest textual datasets from the Bronze Age, in order to build empirical models which inform us of the complex systems (socio-political-economical) and civilisations that rose to greatness over the centuries only to collapse into oblivion.
Bringing an intensive background in historical and computational linguistics, Dr. Anderson has been a postdoc at UC Berkeley since the summer of 2016. He was originally hired to develop the Digital Humanities (DH) program, which combines data science with the many humanities departments and language programs on campus. Since his time here, he has helped build the DH summer minor and certificate programs, the DH Fair, and coordinated the DH Working Group primarily for graduates, faculty and staff, and BUDHA, an undergraduate student organization for DH. He also currently manages the Computational Social Sciences Training Program at BIDS, which combines social sciences and data science for advanced graduate students. He works to constantly find new ways to combine the quantitative tools and methods of data science with the more qualitative disciplines on campus.
What motivated you to initiate/propose this project?
I proposed the AWCA project in order to build an empirical model of the research landscape from a collection of books and articles in the field of Near Eastern Studies. Like many fields in academia, there are inherent divisions which make it difficult to integrate the research from, say archaeologists, with the philological work of Assyriologists. In the end, scholars from two different departments may in fact be studying the very same geographical region, but because of departmental divides, they might not learn of each other’s work or see the big picture. In order to bridge these gaps, one must incorporate a tremendous amount of primary and secondary sources into their research, especially when studying the ancient world. The idea behind AWCA is to build a workflow which will generate this research landscape, and allow us to analyze and visualize the many relations between a diverse body of research for a given field (e.g. books, journals, white papers, preprints, etc.). As it turned out, the models we built are language agnostic, and can be applied to any corpus of texts. Because of this, I have been able to test these tools and methods in the courses I’ve taught in Data Science (Data 88) and Digital Humanities (DigHum 100).
Since you have been a multi-semester Discovery partner, could you describe your project in phases and what each semester has been building up to?
- Phase one: Defining the Workflow & Discovery
- Phase two: Building Python Notebooks
- Phase three: Testing & Fine-tuning
- Phase four: Documenting in JupyterBook & Website
- Phase five: Next Steps, Iteration and Expansion
Could you elaborate on certain aspects of this project that you find to be the most challenging?
The data we’re working with is multilingual, which means that there are not so many python tools that work out of the box. As we apply the tools and methods we’ve built for English text, we have to create the right conditions for these additional languages so that our models can parse these data correctly.
The project is currently between stages three and four, which means we have the majority of the coding complete. Because these methods are very complex, we have found it advantageous to incorporate these individual notebooks into a single JupyterBook. By making this move to the JupyterBook, we hope that the project will be more easily applicable to other textual datasets and accessible for teaching these tools and methods in future courses in Data Science and Digital Humanities.
How has your experience been with Discovery and your student researchers so far (through all the semesters you have been a discovery partner)?
In a word, excellent. The Discovery program is a tremendous opportunity for multi-level involvement, from students to faculty. This has also enabled multi-university collaboration with colleagues interested in participating on these exciting projects. I personally enjoy the less formal, more hands-on aspect of the meetings, which allow us to learn a lot from each other over the course of each semester. I’m a firm believer that there’s nothing you can’t accomplish with a dedicated team of researchers. In addition to advancing my own research in NLP. The Discovery program has been ideal for building new tools and resources for the curriculum I’m developing at Berkeley.
What are you looking forward to the most this semester?
I’m excited to aggregate all of our previous work and building workflows in Jupyter Notebooks. This would create a modular environment for future work and make it easy to hand it off - this work would even complement the curriculum I teach and make it easier for students in little to no experience in NLP and Python to follow along.
Citation analysis is a multilingual problem and the biggest bottleneck in NLP right now is that our current tools work best on English. Some languages are not even encoded yet which makes it extremely difficult to accomplish NLP tasks. I’m very excited to build a universal and wide-reaching tool that could be used by anyone, anywhere.