Last day of class. Final test over. Class dismissed? Not so fast for the students in UC Berkeley’s Data 88E course, who remained in their seats on Nov. 30 for a presentation by David Card, a labor economist and Berkeley professor who recently won the 2021 Nobel Memorial Prize in Economic Sciences for work that challenged orthodoxy and dramatically shifted understanding of inequality and the social and economic forces that impact low-wage workers.
The course, which attracted students majoring in economics, data science and even philosophy, teaches students how to apply data science concepts and Python programming within the discipline of economics.
Card, whose research interests include immigration, wages, education and gender-and race-related differences in the labor market, gave students his views on the critical importance of building accurate representational datasets for analysis, the importance of descriptive modeling and causal modeling.
Building a better dataset
“The ‘build’ is everything,” Card told the students. “A project comes to life or dies based on the build.”
Card gave an example of studying what happens to quarterly earnings when employees move across various jobs. The starting dataset could include 200, 300 or even 400 million people or more. But practically, this needs to be pared down to 10,000 or so cases and the data scientist doing the work needs to be able to explain both what happened to the millions of unused data points and how the remaining set is representational of the overall structure.
It’s not easy, Card said, and often the work needs to be done over and over to ensure that the user understands how the dataset was created and that false entries, such as duplicates and missing values, are accounted for. These problems are endemic to economic datasets, Card said, assuring the students that “everything is all messed up.”
But that’s just the start. Most data analysis in economics includes merged datasets, and merging poses another set of problems. “As a practical matter, I’ve been doing research in economics for 40 years and can say almost all merges are screwed up,” Card said.
The way to approach the merging of various datasets is step by step, merging sets A and B. Start with a subset and merge that with another, check to make sure it’s accurate and then “freeze” it and keep going with that kind of process. It’s a long and difficult process, he said, but essential for yielding defensible results.
Card told the class that there is huge value in studying observational differences, such as comparing men and women who have jobs, different occupations between whites and non-whites, or people who move between two cities, such as San Franciscans moving to Austin and Austinites moving to San Francisco. But, again, you need to be wary. If you think something is one way but the data seem to indicate something else, “nine times out of 10 the dataset is screwed up,” Card said. Many such patterns are predictable and form the basis of labor economics.
Long used in business, research design based on causal modeling studies “what causes what to move around” and is key to microeconometric research, which looks at the economic behavior of individuals, households, organizations or companies. A classic example is when businesses test marketing ideas through A/B tests, giving different consumers different options for buying the same item.
In estimating the effect of x on y, if either the size or sign (plus or minus) of a factor doesn’t make sense, the resulting level of significance is irrelevant, Card said. Yet many newcomers to econometrics often focus on the significance levels for a parameter without considering whether it was reasonably obtained. Although there are some efforts to replace this approach with machine learning algorithms, Card asserted that results so far are not encouraging.
Getting started in research
In response to a question by Data 88E instructor Eric Van Dusen about finding an honor thesis project, Card acknowledged that with 700 seniors each year vying for the attention of one of 40 faculty, it’s difficult. But he offered some suggestions for students to distinguish themselves. If possible, ask an upper-year student who has a project to introduce you to a faculty member.
Also, look for areas where not a lot of research has been done. Card said when he started studying the effects of minimum wages on regular wages, no one had ever studied the topic. When Van Dusen asked about integrating data science into the economics curriculum, Card agreed that data science skills are playing an ever-greater role in economics research.