As Project Jupyter Celebrates 20 Years, Fernando Pérez Reflects On How It Started, Open Science’s Impact and the Value of Diversity in Coding

Fernando Pérez

Fernando Pérez is an associate professor in statistics at UC Berkeley, a faculty scientist in Lawrence Berkeley National Laboratory’s Department of Data Science and Technology and a faculty affiliate at the Berkeley Institute for Data Science. (Photo/ Jonathan Taylor)

News

September 22, 2021

Saul Perlmutter, an astrophysicist and faculty director of the Berkeley Institute for Data Science (BIDS), has been appointed to the President’s Council of Advisors on Science and Technology (PCAST), the White House announced today. 

September 17, 2021

UC Berkeley’s Meredith Lee has been appointed to the governing board of the new California statewide Cradle-to-Career Data System, which will link information across spectrums like education and social services to better equip policymakers, educators and the public to address social and economic disparities and improve opportunities for students across the state to succeed.

September 13, 2021

Transfer students from the City College of San Francisco (CCSF) who take its foundational data science course will now get credit for Data 8 at UC Berkeley, a data science class at the university.

City College is the fourth community college to receive Berkeley’s Data Science Undergraduate Studies program approval for Data 8 articulation. The Math 108 class at CCSF helps students learn about data science and provides relevant transfer credits.

August 19, 2021

Jon Bashor contributed to this article.

Twenty years ago, UC Berkeley Associate Statistics Professor Fernando Pérez started one of the foundational tools for analyzing large amounts of data in a transparent and collaborative way. That project, IPython, evolved into Project Jupyter.

Project Jupyter provides a collection of tools such as the Jupyter Notebook to assist users in the process of interactive computing -- iteratively executing small fragments of programming code to explore, analyze and visualize data and computational ideas. It also allows scientists to view and build upon the work of other researchers worldwide.

Nearly 10 million Jupyter notebooks have been made public by users on GitHub, and the tool has been deemed one of 10 computer codes that transformed science, according to Nature

Jupyter and similar tools have underpinned groundbreaking research like the first image of a black hole. And Jupyter has changed the process of scientific publishing, making it possible for scientists to easily share the data and code behind their conclusions and offering ways to replicate them.

We spoke with Pérez, who is also a co-founding investigator at the Berkeley Institute for Data Science and a faculty scientist at Lawrence Berkeley National Laboratory, about why he started this project, what challenges he’s faced and what to expect from him and Project Jupyter next.

Q: What was your goal when you created IPython In 2001?

A: It was two-fold. One is a set of both technical and epistemological goals. I wanted to do the kind of workflow that is typical in the sciences. We don't tend to program with a well pre-defined objective. We use programming languages in an interactive discovery process.

I was a graduate student in physics and I had started using Python to analyze the data for my Ph.D. thesis. I realized that it was possible to use Python in that interactive, exploratory manner, but it was limited. I thought maybe I could build a small tool that would make that process of running a bit of code, maybe plotting, visualizing some data, continuing to write code based on what I'm looking at in the figure, to open a data file -- that exploratory process -- easier. 

The reason why I wanted to do it in Python is because I was using a proprietary tool for my work, but I wanted to do my scientific work with open tools. I think of science’s mission as opening the black box of nature. It's a bit nonsensical to do that with tools that we are not allowed to open and understand, including proprietary software.

There was also a bit of an ethical consideration, which is that I wanted to be able to share my work with, for example, my mentors back in Colombia. That wasn't possible with proprietary tools because they're very expensive. I wanted to have tools that would allow me to share everything that I did with others.

Then there's a personal reason, honestly, that I was really struggling with my Ph.D. I had been fired by my first Ph.D. advisor, a very toxic personality in the department, and I was struggling to complete my dissertation. That worked out in the end, thanks to the support of another professor in the department, Anna Hasenfratz, who supervised me until graduation. In addition to her scientific mentorship, she had the patience to let me "productively procrastinate" by building IPython, something that I could somewhat justify as a tool for finishing that dissertation. I regained some much needed confidence, I got attracted to building something, and it turned out to be really important. 

The key message here is the impact of finding a community who values your work: for me, IPython was an opportunity to do something that others cared about. It wasn't particle physics. It wasn't particle theory. But it was something that other scientists immediately reacted to. 

When I first posted the first release of IPython in scientific Python lists, others immediately jumped on it. Other scientists, who came from other fields, said, “This is valuable, and we're interested in what you're doing.” That feedback loop was critical to putting me back on track, along with the support of other mentors who helped me out. 

It’s important to emphasize that, while IPython may have started as a small personal project, today’s Jupyter is the creation of a collaborative community, starting with my colleague Brian Granger and including many talented folks who contribute their work to this common mission.

Q: Can you talk about why it's important to be able to share your work with people who aren't working with you on that project?

A: What we are seeing today in the acceptance of open source software, which falls under the umbrella of open science, is that we never know which other parts of the scientific world may have a common need with us. By building tools that can be shared openly and that are interoperable, we accelerate the cycle of discovery. From a purely pragmatic stance, we want better science faster, and we want better discoveries that have impact. 

I think there's also a really important element of access here, too. At Berkeley, we couldn't possibly be teaching at the scale that we teach our data science courses if we were having to manage the cost and logistics of installing proprietary software on the laptops for 1,500 students at the beginning of each semester. That's just unthinkable. But with open tools, we can deploy, organize, access and structure them -- cost-wise and technically -- in a way that fits our needs.

Q: What’s an example of using open science tools, including Jupyter Notebook, to build something that otherwise couldn’t exist that accelerates discovery?

A: One of the projects I'm most excited about is a community that has coalesced around a project called Pangeo, which was originally focused on big data related to geoscience and climate science. It was started by a physical oceanographer, Ryan Abernathey, and a team of collaborators. That's a space that is really important because climate change is the defining problem of our generation, with impacts in the whole world, but that we see acutely in California in the wildfires and drought that have caused so much hardship in recent years. 

With the Pangeo stack, you can go to the gallery, click on one link and -- using any browser, without installing anything -- play with terabytes of climate data in ways that would have been unthinkable five or 10 years ago and that can inform scientists, policymakers and the public’s actions and decisions around climate change. Making it possible for them to access this data is one of the things that motivates me the most. It is the community that I'm putting all of my technical research effort into these days.

I am not claiming by any means that the climate crisis is either a purely technical or purely scientific problem, nor that our tools are going to fix that problem, but at least we can try to help.

Q: You've talked about facing pressure from others in academia to focus more on research earlier in your career, rather than the development of IPython. How did you decide to stay the course on IPython?

A: That's a hard question. Some of it was just stubbornness and lacking a sense of career because it did take a long time, and I don't want to romanticize that. There were periods that were quite difficult in that process. My therapist definitely helped a lot, and I would not have made it through without the support of my wife, who at the time was also a graduate student at University of Colorado, Boulder, and thus had her own work to manage. I think it's important to acknowledge those kinds of resources and support. 

Very importantly, though, there were people at UC Berkeley early on who supported me when I was still a postdoctoral scholar in Colorado doing more traditional applied mathematics research with Python tools. I wasn't invested enough in the purely applied mathematics community to make a career just out of that. People at UC Berkeley that I connected to because of the Python community offered me a team and, eventually, a job. 

At first, we connected around developing grant proposals for the National Institutes of Health to fund open source tools in Python for science through Jarrod Millman, one of the early advocates for Python in science who continues to lead projects in this space. This was 2004-2005, when nobody believed in this. Connecting with that team at UC Berkeley was critical because it gave me a legitimate scientific path where there was space for this work. It was unusual, but Mark D’Esposito, who was the principal investigator for the Berkeley Brain Imaging Center, was willing to believe in that and to fund a position for that. The first grants didn't get funded. They later got funding, and, in 2007, Mark offered me a position at UC Berkeley. I was at first a research scientist at the Brain Imaging Center. After that, we were able to stitch together this strange path.

I think acknowledging the support of individuals at UC Berkeley and the fact that the institution is willing to take risks and allow for people like me to exist is important. I want to credit them because I did have a job offer right when that offer came from UC Berkeley that was going to be basically going into industry and writing software for the finance industry. I'm sure my professional path would have been very different if I had accepted that.

Q: People might be surprised how much of the work on Jupyter and other open science tools is done by volunteers. Why is it important to pay people for this work?

A: For me, I think this has been a process of iterative discovery and understanding that these projects -- even if they look like software projects -- are also actually about people who need to think in the long term. We need to structure this work with a strategic vision. We need to do day-to-day maintenance. We need to engage with the community. We need to do outreach. Some of these tasks are not always the most fun thing to do on a Friday night, and so they're the kind of things that don't get done if you're relying on volunteers. Others may be fun, but they may be long and hard and require dedicated effort, and unless you can really spend your whole day for days on end working on that problem, you're not going to be able to solve it.

Also, if you only rely on volunteers or on people who perhaps their job allows them to do the one thing they like, you will exclude parts of the population that don't have those affordances. By building tools without a complete slice of the society you want to reach, you can't possibly have the impact that you want in that society. It's not just a technical mission. It’s an ethical mission of building things that have a positive impact in the world. That impact isn't going to be realized if we're building things only by a few, because when you build by a few, you build for a few. If we want to build things that really are for everyone, we need to build them with everyone. 

We're also creating a huge amount of strategic risk by building the scaffolding of science on these projects that are maintained by a handful of people -- in some cases, literally a handful -- who are unsupported, are taking professional risks or a pay cut, or are giving up a more stable opportunity. It only takes, in some cases, one or two or three people to quit, and some of these projects will literally crumble under their weight. We say, “Oh look, the discovery of gravitational waves, a black hole.” Then you start looking and all of that great work is resting on this inverted pyramid with two people at the bottom. It’s crazy. 

With proprietary software, I have objections on ethical and epistemological grounds, but they are legitimate business models. You pay for the software. Through those sales, these companies hire engineers. They hire sales people. They hire tech support people. They hire documentation people. All those jobs need to be done in open science, too, and we need to pay for them. It's better if we pay for them as open infrastructure that society can benefit from.

Q: I see I’m at the end of our time, but I’d be remiss if I didn’t ask: what’s next for Jupyter?

A: On the Jupyter front -- I don't fully know! The good news with Jupyter is that really, in a meaningful way today, I am largely irrelevant in the project. I mean that in a good way. There are so many talented people pouring their efforts into the project. Often I find out about cool things in Jupyter on Twitter or on Discourse, our community forum, and I’m like, “Wow, that came out? That's incredible.” That’s wonderful, but it’s also a challenge because the project has become so large. 

This growth presents challenges in terms of managing the community. For the last while, we’ve been working through a very complex process of restructuring our governance to better serve this growing group that includes stakeholders ranging from individual volunteers around the globe to teams at the largest tech organizations. It’s been a fascinating effort that I hope will set up the project for at least another 20 years of impact.

On the technical front, I am very excited about how the infrastructure we now have between JupyterHub, JupyterLab and JupyterBook gives us the foundation for collaborative, open science at scale. It takes advantage of the cloud in the whole life cycle of research, from individual exploration to large-scale analysis to publication and teaching. That’s been our vision for years, but the tools are now mature enough to use them to build platforms like Pangeo, so people can collaborate openly on large and complex problems that require diverse expertise.

Q: How about for you?

A: For me, the Pangeo effort has been an important inspiration both on computing and regarding my scientific interests. We at Berkeley joined forces with collaborators from Pangeo and the University of British Columbia (UBC) to create a new nonprofit organization called 2i2c, or The International Interactive Computing Collaboration. The point of that organization was to basically try to scale up beyond our universities -- beyond UC Berkeley, beyond Columbia University, beyond UBC -- the kinds of infrastructure that we were seeing we could build. 

At UC Berkeley, we had all of our efforts in data science education and Jupyter-related development, with leadership from Cathryn Carson, Chris Holdgraf, Yuvi Panda, Lindsey Heagy and others. At Columbia, Ryan Abernathy was building Pangeo. At UBC, Jim Colliander and others were building a project called syzygy for offering Jupyter infrastructure to Canadian researchers and a project called Callisto for K-12 education. All of us were finding ourselves in a similar setting where we said, “It is great that we can build this, but a university isn't quite the way to scale this up.” 

So what we tried to do was create an organization that will push both the deployment of this infrastructure for social good -- for research and education -- and the sustainability of those projects and tools. Our vision is to build these tools and platforms for open science-at-large in society. By that I don't mean a platform like Facebook or Twitter, the monolithic thing that everybody has to go to. I mean platforms that can be assembled and re-used by others in their settings to tackle their own problems, with independence from any given vendor, and with the privacy and autonomy that their communities require. 

And for me, there are now three related directions where the above come together. First, with the diverse communities in Jupyter, Pangeo and 2i2c, I hope we will build in the next few years the “operating system for open science in the cloud”: vendor-agnostic tools, which will be easy to use and highly capable technically, to advance science and education for all without barriers.

Second, I personally want to focus my own efforts in using these tools towards the climate crisis. I am personally a lover of the mountains and all things cold, and we are losing the ice of our planet with devastating consequences. As a physicist, I find that space is a fascinating mix of physical principles and statistical challenges, and I am collaborating with cryosphere scientists and statisticians on these problems. It is important to have a concrete use case in order to develop genuinely useful tools; for me, climate and the environment are now the motivator to drive my development ideas for Jupyter.  

And third, finally, all of this work we’ve talked about is driven by software, but software that is deeply embedded in the specifics of scientific research. That’s how it all started for me and for other scientists, too. We sometimes hear this described as “software eating the science world.” If we accept that premise, then I think our agenda should be one where we: a) take software more seriously as a core element of science, and therefore teach our students how to build it in ways that complement the work of our computer science colleagues; b) develop research efforts to truly explore what is unique about the intersection of software and science; and c) reward the careers of those who do this kind of work at all stages -- from students to engineers, researchers and faculty -- and therefore fund their work with a comprehensive vision that goes beyond “publish a paper and throw away the code.”

Those are the spaces where I hope to build at Berkeley and with our colleagues beyond campus in the coming years: better computational tools for better science that is open to all and that has real-world impact in our most critical problems, starting with the climate crisis.

Q: Thanks so much for talking with me. Anything you’d like to emphasize or add?

A: I have to say on this last point -- designing and developing platforms with this vision -- that this is something that I know we are actively working on at CDSS, all the way up to Jennifer Chayes. She has been very supportive of this vision. I'm very excited that we can build on the role of software in science and the role of these open platforms in science. 

My hope is that these successes of the last few years are only the beginning, that in five or 10 years, we'll be able to say that we actually did build something actually much better, much bigger and more sustainable that doesn't require sort of a kind of crazy contortions that people like me -- and I'm not the only one -- have had to do for the last 10 or 15 years.