Last year we worked on building a JupyterBook workflow (with python Jupyter Notebooks) that takes in a large collection of PDFs and OCRs the files in preparation for building language models. So far these models include: Topic Modeling (LDA), Word Embedding (W2V & D2V), and Deep Learning (BERT). This workflow is mostly complete, although we may look at ways of optimizing the Deep Learning component. In the coming year we hope to work on building better segmentation into the raw text, both in preparation for Deep Learning (with encoder / decoder values), but also in terms of parsing the citations for Author, Title, Date, and Publication info. These will then be used as features for the Node List, as the Language Models already build Edge Lists for Network Graphs (see our AWCA webpage for more details http://digitalhumanities.berkeley.edu/ancient-world-computational-analysis-awca). Lastly, we will work on building a Web app that uses D3.js for large network visualizations. This will allow users to run through these tools with any dataset they choose to upload.
Term
Fall 2020
Topic
Humanities