Data Science Discovery Program Students Help Open Accounting Data from Ancient Mesopotamia

Niek Veldhuis

Adam Anderson

Adam Anderson

Cuneiform tablet with grain rations from the Ur III Period, c. 2100-2000 BC, Harvard Semitic Museum at Harvard University.

This 42 x 27 mm Ur III tablet, belonging to English collector David Johnson, lists commodities such as beer, flour and oil for travel, along with the name of Abbamu, an emissary, going to Susa. (https://ant.david-johnson.co.uk/catalogue/70)

This Babylonian tablet, measuring 57 x 41 mm with 21 lines of text on front and back, sold at a Christie’s auction in July 2019 for 18,750 British Pounds.

August 9, 2021

Undergrad DS students add a range of expertise to decoding clay tablet tabulations

In many instances, data science points to a brighter future where better machine learning tools and algorithms are used to solve problems ranging from climate change to better healthcare access, from improving traffic safety to finding the very best deals on Prime Day.

But some UC Berkeley researchers are applying such tools to learn more about the past, like deciphering business and personal transactions recorded on clay tablets in ancient Sumer. Such tablets are drawing international attention with the news that the United States was returning 17,000 tablets and other ancient relics to Iraq at the end of July.

Adam Anderson joined Berkeley in 2017 as a Mellon Postdoctoral Fellow in the Digital Humanities and has been a lecturer in digital humanities and data science. His Sumerian Network research project, in collaboration with Professor of Assyriology Niek Veldhuis, focused on developing the tools and workflow for analyzing the content of 15,000 ancient Sumerian clay tablets and adding the resulting data models and code to open data repositories. The overarching goal of the project has been to make their work fully reproducible so other researchers could contribute to the translations and they have created the Sumerian Networks Jupyter Book to demonstrate this process. 

For help, Anderson and Veldhuis turned to the Discovery program, a key part of Data Science Education in the Division of Computing, Data Science, and Society at Berkeley. The Data Science Discovery program matches interested students with researchers working on campus, in nonprofits and government agencies, and in industry. Over four years, 15 data science students contributed to the project, some for up to three years, others for a semester or two. To find the students, Anderson listed the types of tools the students would be using -- natural language processing, Python, machine learning -- and the Discovery program staff did the matching.

“We were looking for a lot of different types of students to work on different aspects of the projects. You’d be surprised at how many students are craving something outside of their STEM fields,” Anderson said. “The problems are complex and multidimensional and some students see the work as better preparation for jobs as they coordinate and combine processes to achieve the goals.”

A paper describing the project will soon be published by Interdisciplinary Digital Engagement in Arts & Humanities, a peer-reviewed, online, open access journal. The paper is co-authored by Anderson, Veldhuis and Anya Kulikov, one of the Discovery program students

Kulikov, who graduated from Berkeley in May 2020 with a degree in data science and linguistics, worked on the project for two years and the paper through December 2020. Intrigued by the work combining humanities and technology, she contacted Anderson directly and was added to the team. Her task was to identify the words in the text that referred to commodities and develop a tool for counting them. The job was made easier by the fact that commodities were always followed by a number in the texts, whereas other terms weren’t.

“By figuring out the patterns based on placement in the sentences and being careful and diligent, the commodities were pretty easy to find,” said Kulikov, who now works as a software engineer for Goldman Sachs. “Working in a different language from the ancient world was very cool. I was surprised to learn about their advanced accounting system with different units for liquids and solids, much like our system today.”

“For me,” Veldhuis said, “one of the very attractive sides of the Discovery program is to learn from students. In my role as a professor, I am often expected to know everything and to teach others.

“But here I get students who have a thorough understanding of coding, and have a much broader view of the possibilities and the pitfalls than I have,” Veldhuis said. “I teach them something about the Sumerian culture, they teach me about data science. And that has proven to be extremely fruitful.”

The Discovery program began as a grassroots effort to give students experiential training to prepare them for either a job or graduate school, said Anthony Suen, Director of Co-Curricular Programs for Data Science Undergraduate Studies at Berkeley.

“Participating in the Discovery program gives students a more complete perspective,” Suen said. “They also encounter hiccups, don’t succeed on their first try and then get back to work.”

Transferring data from tablets

The core of the project is the data found on 15,000 clay tables from the Third Dynasty of Ur, which ruled an area straddling the Tigris and Euphrates rivers that is now part of Iraq. The area includes the Fertile Crescent and a number of city-states, including Babylon. Also known as Ur III, the dynasty ruled during the 21st century BC. Data from the tablets are curated online in three databases: the Open Richly Annotated Cuneiform Corpus, the Database of Neo-Sumerian Texts and the Cuneiform Digital Library Initiative.

The tablets form an accounting system for commodities such as sheep, goats, oxen, and even wild animals that were traded, along with related products such as wool, leather, shoes, as well as grains, beer, metals, ores, treasure, and tributes to rulers. The tablets also include the names of people involved in commerce. Used by six city-states in the area, these tablets comprise what Anderson calls “the stock market of the ancient world, listing the comings and goings of these commodities.”

The Sumerian Network project is aimed at building reproducible socio-economic networks from the data, then refining these models to more accurately reflect the actors and entities active during the Ur III period (2011-2000 B.C.). The 15,000 texts are from the site of Drehem (known in antiquity as Puzriš-Dagān) and began to appear on the antiquities market around 1910 after the site was looted. They are found today in museums in Iraq, Canada, Europe, Japan, and the United States, as well as in private collections. 

Many were looted in the aftermath of the Gulf War, and ISIS is believed to have sold some to raise money, such as a collection that was sold to a private collector for $1.6 million and intercepted in the United States several years ago. One of the largest private collections in the U.S. comprises 14,000 Mesopotamian relics and is being returned to Iraq as part of a recent agreement between the U.S. and Iraq.

“This is the biggest data repository in the oldest civilization I could find. Sometimes they come up for auction and we grab the image -- we try to get the data by hook or by crook,” said Anderson, whose studies of ancient languages have taken him from Harvard to Munich and Copenhagen to Jerusalem. “There are a lot of telltale clues in these tablets -- names, dates, where the transactions occurred. We want to figure out where the money is going, which city-state is in power, and the relationship between the people and objects.”

Anderson believes there are enough such clues to identify the people involved, even though many names are common, adding “these are real people doing real things.” This disambiguation process is the next phase of the project. But, he admits, this is just the tip of the iceberg. An estimated 120,000 tablets from Ur III are currently accounted for, but experts believe many more tablets have not been excavated.

Anderson began his linguistics studies focusing on Semitic languages, such as Hebrew, but then discovered vestiges of earlier languages, which were precursors to the newer ones.

“The Sumerians invented this writing system and there are hundreds of thousands of documents, many of them not translated,” he said. “That was really exciting for me -- this is the great-grandfather of Biblical language. But the system is so complex, we have to use computational methods to get the full sense of it.”

The project has brought together archaeologists, cuneiform specialists and experts in computational text analysis and natural language processing from around the world and spans data science, the social sciences and digital humanities. It’s also drawing together the disciplines of philology, which studies the development of languages, and archaeology, which often uncovers evidence of languages, such as tablets.

Building in reproducibility

To ensure that the research results are reproducible, Jupyter notebooks were developed to describe the various natural language processing tools and methods used in connection with the code and dataset, which led to a series of empirical network models. The collection of notebooks clearly showing what was done step by step is now being combined into a Jupyter book. The idea is that other researchers can use the same methods to expand the database of translated tablets.

According to Anderson, the biggest challenge is developing tools to read the glyphs on the tablets -- existing character recognition tools were mainly created to read English writing, such as Latin script.

In addition to helping develop the tools, Discovery students created tutorials explaining how to use them. The students have also gained experience in presenting posters about their work at campus events. In addition to receiving academic credit, Anderson and Veldhuis ensure the students get credit as co-authors of research papers and for the Jupyter notebooks available on Github.

“Faculty engagement with the students is really critical to the success of the Discovery projects,” Anderson said. “The students love it and keep working on the project. I can tell because some of them are meeting with me on Saturdays.”

Colman Bouton, who graduated in May 2021 with a degree in applied mathematics and a focus on biology, said he had taken a class from Veldhuis and was looking for a project that he could work on, but in a field he wouldn’t otherwise tackle.

His first job was to create a tutorial to present during a GraphXD online conference hosted by the Berkeley Institute for Data Science (BIDS). GraphXD is a cross-domain initiative that promotes interdisciplinary collaboration and training for researchers, scientists, and theorists using graphs and network analysis for applications in a variety of fields. He learned to use NetworkX, a Python library for studying graphs and networks.

“It went pretty well,” Bouton said. “I hadn’t used NetworkX before, but once I figured it out it was really neat to see the visualizations of the commodity transactions between individuals.” The plots indicated more transactions between more people at the center and smaller nodes farther out.

“We saw the same people together, with more close communication,” Bouton said. “Their society was complex, but from the data, we saw that the bigwigs didn’t interact with each other, not even the bigwigs close to them.”

Longer-term, Bouton worked on lemmatization of words in the data, removing inflections and tenses of words to get to the core meaning, then associate it with the English word. He said this made it easier to do operations on the text and see what was happening.

“We’re taking a shot at defining the structure of their economy and it’s interesting to see the economy change over time,” Bouton said. “We would go back and see if the results made sense, sort the commodities over time. We’re also trying to figure out if the big dips in trade were bad years due to flooding.”

Bouton said he was surprised that all of these exchanges were recorded and how thorough the commodities were tracked, much like today where the focus on numbers is such a part of our culture. “The Sumerian roots appear to be an important part of the groundwork of civilization,” Bouton said. “Maybe that was needed to have such large functional cities.”

One appealing aspect of a project like this is that students are contributing to a larger ongoing effort, rather than doing something that starts and ends in the course of a semester. Although the pandemic shutdown has made his interactions less personable, Anderson said he still has conversations with students beyond their progress on the tasks at hand. He talks about their career paths and they ask for letters of recommendation.

“They are getting real work done and have some real accomplishments they can point to -- it’s not just theoretical,” Anderson said.”They help build things, update the work of others and help keep the overall project spinning along. As is often the case in the humanities, you’re never really done with a project -- this is one of the world’s greatest puzzles and I could work on it for the rest of my life.” 

In talking about what has been learned to date, Anderson repeatedly talks about the people, not the dynasty and the collective city-states.

“It’s easy to have a dated point of view of antiquity, but we are seeing that it is incredibly personal and individualized,” he said. The Bronze Age cuneiform archives reveal that “they had credit accounts, loans, marriages and divorces, and even poor families selling children into slavery. We can see the complexity of life, the treaties, legal system, and the business that was going on throughout this time.”