June 1, 2017

Jupyter Notebooks powering Berkeley’s data science curriculum

Jupyter being used in UC Berkeley Foundations of Data Science Cours

To catch a glimpse of the college course of the future, take a look at UC Berkeley’s Foundations of Data Science (Data 8), the pioneering course that introduces statistics, computer science and data analysis to lower-division students of all majors. 

During the lecture, students can write and edit code on their laptop computers in real time, working on the same data set and code that the professor is discussing in class. Learning in the class is active, and interactive: Students can try their own calculations, using large data sets that are available at the students’ fingertips, and follow up with questions to the professor.

Underlying these classroom innovations are Jupyter Notebooks, a web-based platform through which students can work on projects and submit homework assignments and labs. Jupyter is an open-source, international project with one of its home bases at the Berkeley Institute for Data Science. One of the project's co-founders, Fernando Pérez, leads a local team that has allowed the Berkeley data science curriculum to leverage this transformative technology and use it at scale. No need to buy textbooks or lug them around: the course’s textbook is available online, free of charge and linked to notebooks that are hosted in the cloud. Students can access all their assignments, lecture notes and videos through the course website

Fernando Perez photo, UC BerkeleyFernando Pérez - Co- founder of Jupyter

The Foundations course and its online computing environment were designed to make data science available to a broad array of students — many without previous programming or statistics experience.  The course’s instructors say that type of wide reach wouldn’t be possible without Jupyter notebooks, which enables browser-based computation, avoiding the need for students to install software, transfer files, or update libraries.  

“Without the software technology tools that we’re using, the Foundations course wouldn’t exist,” said Statistics Professor Ani Adhikari, one of the course’s instructors and co-developers. “There would have been too many obstacles for students to easily gain access to the tools and data that they need.” 

The Jupyter platform allows students to weave calculations, code, data visualizations, and explanations into one document that they create interactively. In the course, all assignments are distributed as Jupyter notebooks that contain outlines of data analyses. These outlines guide students through the process of discovering, validating, and communicating insights from data. The notebooks let students learn how to “think together” with the computer, a broadly important and powerful skill.

“This format is ideal for data science because analyzing data requires more than just running computations. We encourage all of our students to describe their assumptions, observations, and conclusions when working with data,” said Computer Science Professor John DeNero, who, with Adhikari, co-developed and teaches the course.

The Jupyter platform has also enabled the development of a variety of other Berkeley data science courses, including dozens of connector courses and data science modules. Its ease of use has made it a powerful platform, even for instructors with little prior experience in programming. Berkeley’s large-scale use of Jupyter Notebooks is enabled by a cloud-hosted deployment of JupyterHub, infrastructure that ensures that all students and instructors share a common computing environment, data sources, and storage for their notebooks in the cloud. (Learn more.)

According to David Culler, professor of computer science and interim dean of the new Division of Data Sciences, Jupyter notebooks can be likened to a digital narrative or essay, in which students start with an idea and some data, and go on to refine the idea with a series of “computational paragraphs” made up of code, text and visualization. Then students finally arrive at an interesting conclusion -- all within a single document.  

“The view of computing as an integral part of human creative processes has been around for fifty years, but it is a slender strand of thought that is often lost in all of the exponential advances of technology with its new apps, new machines, and new services,” Culler said.  “The computational notebook brings this aspect back to the fore in a manner that is especially powerful as their first exposure to computing,” said Culler.

Jupyter Notebooks: a powerful platform for data science education

Used by more than 1 million academics and professionals in fields ranging from finance to astrophysics, Jupyter notebooks have won significant praise and support and emerged as the gold standard application for data science. The original iteration of the platform, known as IPython, was first developed in 2001 by Fernando Pérez, then a graduate student in particle physics. Today, Pérez is a Staff Scientist with the Lawrence Berkeley National Laboratory and an Associate Researcher and Senior Fellow at the Berkeley Institute for Data Science (BIDS).

Pérez first developed the free, open-source tool in response to a challenge he faced in his own research as he grappled with a theoretical physics problem. Over the next decade, he and a team of other researchers developed IPython into Jupyter, an interactive computing environment that allows a researcher to run experiments and get results in real time, and to display data in a wide range of ways.  Today, Pérez and his co-investigator Professor Brian Granger from CalPoly continue to lead the Jupyter team with support from the Moore, Sloan and Helmsley Foundations as well as the Department of Energy and industry partners.

 Jupyter screen photo

The software is considered a game-changer for research because it enables scientific researchers to share detailed descriptions of raw code that then allow others to validate and build on their research. The tool’s computational notebooks are described as the computing equivalent of a scientist’s lab notebook -- an environment in which scientists worldwide can develop code and run it immediately in their notebook environment.

Perez notes that even in a world in which huge data centers in the cloud are processing mountains of data, it is still humans who must define the important questions, extract insights, and act on those conclusions. He says that Jupyter is designed to be the bridge between humans and technology, across programming languages and in environments ranging from a laptop to a supercomputer.

“The role of Jupyter is to give students, researchers, journalists or industry engineers tools that give them a coherent handle on the entire process of computational exploration and discovery,” said Perez. “We have built it so the same tools are used for individual data analysis or to create a published article, course or book. Ultimately we imagine scholarly research, policy decisions or journalistic analyses to seamlessly connect to the sources of data and computation that informed their conclusion.  This vision is one where open science, education, and culture support a more informed citizenry, and we hope Jupyter contributes to that future.”