Berkeley Data Science Students Help Uncover How Flaubert Honed “Madame Bovary”

Ramona Naddaff is a professor of rhetoric at UC Berkeley.

Ramona Naddaff is a professor of rhetoric at UC Berkeley.

Published as a two-volume book in 1857, “Madame Bovary” has been analyzed and parsed ever since, the focus of at least 20 translations into English alone and more than a dozen films.

Published as a two-volume book in 1857, “Madame Bovary” has been analyzed and parsed ever since, the focus of at least 20 translations into English alone and more than a dozen films.

October 5, 2021

bovary

When Gustave Flaubert published his groundbreaking “Madame Bovary” as a serialized novel in “Revue de Paris” in 1856, the story stirred up more than outrage and charges of obscenity against the author and publisher. It introduced a more realistic, less romantic approach to Western fiction and is now known as the first modernist novel. 

Published as a two-volume book in 1857, “Madame Bovary” has been analyzed and parsed ever since, the focus of at least 20 translations into English alone and more than a dozen films.

Now, Ramona Naddaff, a professor of rhetoric at UC Berkeley, is digging deeper into the author’s influences, self-censorship, editorial censorship and editing methods using data science tools as part of the campus’ Digital Humanities program. Over the past year, students in Berkeley’s Data Science Undergraduate StudiesDiscovery program have applied their talents to helping Naddaff unravel these secrets by developing tools to analyze and compare Flaubert’s seven hand-written drafts and the final draft as marked up by his publisher.

Flaubert’s style has been described as a “muscular prose” in which he carefully chose every word line by line, only to reject many of them and start anew in subsequent drafts.

“It’s writing by suppression, where he is deleting, deleting, deleting,” Naddaff said. “He rewrote and revised his prose time and time again up until the novel’s final edition.”

One modern publisher of an exact replica of the book notes that, from June 1851 to April 1856, Flaubert wrote at least 4,456 pages in creating his novel, which covered 470 pages in the final version.

By carefully analyzing and comparing all of the drafts, Naddaff seeks to learn more about Flaubert’s edits and revisions. Was he amending the text to avoid further charges of obscenity? Or just paring and polishing it for publication? Was he bringing the influences of other writers to bear on his work as he reworked his text over and over again?

Naddaff’s findings will be featured in a book she’s writing, tentatively titled “A Writer’s Trials: On the Writing, Editing and Censorship of Madame Bovary” She explored similar ideas in her earlier book, “Exiling the Poets: The Production of Censorship in Plato’s Republic.”

Plugging into data science

With assistance from Adam Anderson in Berkeley’s Digital Humanities, Naddaff was able to enlist the help of four data science students through the Discovery program: Grace Chen, Prachi Deo, Christina Lov and Ellen Persson.

“Through Digital Humanities, I’ve been able to get a database I can work from as I do my analysis,” Naddaff said. “I can use the data to document what was done, whether through self-editing or by others -- with data to back up my findings.”

Naddaff made the digital connection through a call to the D-Lab help desk. Then a postdoc, Anderson took the call and thought the project was a perfect fit for the Data Science Undergraduate Studies Discovery program, through which students can work on projects outside their major.

“I knew I’d get some good students, and they all just dove in and had a blast,” said Anderson, who was a research training program manager in the Berkeley Institute for Data Science (BIDS) at the time. “And the experience is valuable -- they could get jobs with Amazon helping to train Alexa.”

Data Science Undergraduate Studies  and BIDS are part of Berkeley’s Division of Computing, Data Science, and Society (CDSS). One of the goals of CDSS is to expand the interactions between data science and many disciplinary research domains, including the humanities.

The project used an extensive, complex database originally compiled by the Centre Flaubert, a research consortium at the University of Rouen. Each page of the drafts had been digitized showing the original text, cross-outs, additions, marginalia and other edits.

“This is pretty complex because there are different layers of text,” said Anderson, who joined Berkeley in 2017 as a Mellon Postdoctoral Fellow in the Digital Humanities and has been a lecturer in digital humanities and data science. “We had to scrape the data smartly and split out the text for each folio, or page, or else we would have been left with a lot of gobbledy-gook.”

To allow the type of analysis Naddaff was seeking, the data science team used natural language processing techniques in order to pull apart the words and identify every unique word, a process known as tokenization. This allowed the team to count the frequency of each word and compare it with a lexicon to determine whether the word is a noun, verb or adjective, its gender and the number of times it appears in the text.

“Once you have that, you can contrast the types of verbs used from one version to the next,” Anderson said. “You can get mathematical with the text.” 

After that, a method known as n-grams can be used, which can look for specific words and also identify the words before and after that word every time it appears, or the two words preceding and following the word, etc. In essence, this is what neural networks do, Anderson said.

“In this way you really get to see the contextual use of the words when they are cross-referenced,” he said. This then forms the basis of a language model that further helps the analysis. For example, Naddaff could easily identify which clusters of words are commonly used together. 

As the four Discovery students finished their tools, they created well-documented Jupyter Notebooks for using them, Anderson said, as well as videos showing step-by-step how to use those notebooks. Not only will these be valuable to Naddaff, but it will allow others to replicate and verify her results.

“The notebooks are easy for someone else to pick up and use, and they are reproducible and extendable,” Anderson said. “The tools turn what seems like an impossible task into something you can wrap your head around.”

Ellen Persson, one of the students, had taken French in middle school in Orinda and that sparked her interest in the project. She had also done more traditional analyses of books such as Beowulf in the original old English as well as “Le Morte d'Arthur,” the first major work of prose fiction in English, completed in 1469 or 1470. 

“I was interested in the Madame Bovary project because it was different from what we had done before,” said Persson, who is studying engineering mathematics and statistics. “Instead of trying to understand the characters or the plot, we were viewing Flaubert’s editing process.”

Building on what she learned about logistic regression in Data 100 for predicting if a value is 1 or 0, or if a message is spam or not spam, an answer true or false, Persson developed a way to predict if a line would be struck or not struck in each version. For her, the key indicator was whether the line contained the name of a specific character who was prominent in the first draft, disappeared in the second, then returned with a lowered profile in the following versions. If the character’s name appeared in a line, it was 50 percent more likely to be struck.

“What I liked most was applying the skills I had, but it still challenged me,” Persson said. “I felt like I had the skills needed to get started and I could develop further skills. I definitely learned more about coding in Python.”

Discovery student Ellen Persson used what she learned in Data 100 to create these graphs to show the difference in the frequency of offsets for the characters in different drafts of "Madame Bovary."

Digging into Flaubert’s style

In addition to better understanding how Flaubert edited or censored himself, Naddaff also plans to look for evidence of him being influenced by other 18th- and 19th-century writers. How did his careful reading and study of other writers influence the editing of his work, Naddaff wonders. “Flaubert was a writer under the influence of poets and novelists he both admired and detested, not just the solitary hermit genius author many believe him to be,” Naddaff said.

“One of his lovers was Louise Colet, a minor poet, and he edited her poetry among other things,” Naddaff said. “Flaubert  would send his carefully copyedited pages to Colet and she would respond, accepting and rejecting the changes he made to her poems, deciding which of his editorial remarks captured her own sense of the poems’ form and content. The way he edited her was but one of the ways he learned to edit himself and to develop his own style.”

Even after more than 150 years, researchers disagree on Flaubert’s treatment of the romantic and realist traditions.

“He’s writing ironic parodies of them, of course, but he is also trying to assimilate them into a new style, a hybrid form, poetic prose and prosaic poetry,” Naddaff said. “There was a whole invisible network composing the novel with him.” 

In the end, Naddaff plans to make use of data “in ways I haven’t seen done a lot. This data will make the results strong and detailed,” she said. “Adam really thought a lot about how to do this -- he brainstormed, informed and structured the project.”

Anderson said using data-driven analysis could dramatically change how research is done in the humanities. 

“Often the methods are subjective and not clearly spelled out, with a researcher saying ‘here are my results based on my 30 years of experience,’” Anderson said. “By bringing these tools into the humanities, we can produce results that are fully contexualized and indexed. I’m really glad Professor Naddaff thought of this and asked for our help.”