The Beauty of DataHub

DataHub, UC Berkeley Division of Data Sciences

May 2, 2018

A new tool on the block is rapidly shifting the future of how computing is taught in institutions, and UC Berkeley developers and professors are pioneering its integration into an increasing number of campus classrooms. Since first entering a single Berkeley classroom in 2015, the JupyterHub platform for UC Berkeley, called DataHub, is now being used by over 20 classes offered on campus. Thanks to this new platform, more students can find that the power to integrate computing into their education is easily at their fingertips.

The JupyterHub platform helps launch Jupyter notebook servers which allow users to create and share code, visualizations, and information across multiple computer languages. What makes these notebooks truly revolutionary in the classroom is that they provide a uniform and easy to set-up coding environment. Students are able to dive straight into their work, whereas before, they were greeted with clunky installation and troubleshooting problems over many different computing environments.

“DataHub sidesteps this problem,” Yuvi Panda, the Operations Architect helping develop and scale Berkeley’s DataHub said. He pointed to DataHub’s uses in many new campus data science integrated courses — all of which use DataHub to distribute and grade homeworks, set up projects and data sets, and even give lectures and provide course help. “It gets (students) straight to the point of ‘I’m going to learn data science, and I’m going to learn it now,” Panda said.

Jumpstarting Jupyter at Berkeley

The notebooks have become even more powerful learning tools through the assets that Berkeley’s DataHub developers have incorporated into the JupyterHub framework. For example, DataHub makes it easy to create and distribute assignments through Interact Links. These Interact Links, linked to Github repositories, transform a complicated string of downloads and dataset downloads into one simple click. DataHub also incorporates campus autograding systems into the JupyterHub framework, giving students crucial feedback on the spot when faced with a challenging homework or project question. Instead of copying and pasting bits and pieces of buggy code, students are able to reach out to their professors and course staff by directly sharing their notebooks.  The instructors can easily open and run the problematic code on their own computers to help guide their students through their problems.

A huge problem that Berkeley’s DataHub team has tackled so far has been scaling JupyterHub to meet the demands of all the students and classrooms that want to use it. JupyterHub was first integrated into one campus classroom, through a small local cluster of servers. As DataHub needed to scale to support increasing demand, the solution to the problem awaited in the cloud through a framework called Kubernetes.

“Everything you can do on your computer with a Jupyter notebook that is locally installed, you can do in the cloud,” Panda said, “But more importantly, you can do so much more.”

For example, Panda explained how many students were limited by the 8 gigabytes of ram on their laptop when trying to work with a 60 gigabyte dataset.

“You can get 60 gigabytes very easily from a cloud provider and use it for just the time you need to and then get rid of it,” Panda explained. “This would be much more difficult to do on your laptop.”

The Kubernetes cloud computing framework allowed Berkeley to massively scale the number of users that DataHub could support — users that jumped from the hundreds to the thousands — and is soon expected to reach tens of thousands, all throughout the world, following Berkeley’s Data 8 edX launch.

Without a doubt, the JupyterHub platform makes it so that accessing expensive computing machines, receiving supporting materials, or finding help never become limiting factors for interested students — a point that campus professor and researcher Fernando Perez, who played a key role in the development of JupyterHub, strongly emphasized.

“There is never a barrier to access,” Perez said.Andwith Data 8 and Data 100 enrollments on campus reaching record numbers and still continuing to rise, Perez added, the Jupyter DataHub “was the only realistic way to tackle the logistics behind the course.”

Jupyter Beyond Berkeley

Perez built the IPython programming environment in 2001 as a side-project while in graduate school. This IPython project gathered interest and was eventually adopted as part of Project Jupyter, a community of engineers working together to develop and spread tools for interactive data science and scientific computing across multiple programming languages. With more people jumping on board this tool, versions of the notebooks are finding themselves in research labs, classrooms, and industries such as Microsoft, Google, and IBM.

“We’re at a point that it’s sort of obvious to everyone that using the cloud in this way to improve installation difficulties is a very useful thing to do,” Panda added. “Not just for pedagogy, but also for industry data analytics.”

At last year’s New York Jupytercon convention, a tech conference bringing together data scientists, business analysts, researchers, developers and other core members from the Jupyter Project community, several of Berkley’s developments were spotlighted. Members of the team behind DataHub, including graduate students Vinitra Swamy and Gunjan Baid, shared the work that Berkeley had pioneered while integrating Project Jupyter into popular classes such as Data 8 and Data 100.

Several universities, from Europe to various parts of the US and Canada, have already used Berkeley’s development to launch their own Jupyterhub servers.

“Over time, people are starting to come around and build on top of our stack because we made it open source from the beginning and we made this strong effort to make it useful to everyone from the beginning,” Panda said.

“The coolest part is that they’re working on managing anywhere between 50,000 to 100,000 active users at the same time, a scale that has never been done for a data science education context before,” Swamy said. “Displaying our work to this world of enthusiasts that have been dealing with similar problems in areas of finance and business, where they’re taking these large computing cloud clusters and trying to scale them up, and showing them what we’ve been doing at Berkeley — it has been an incredibly rewarding journey,” Swamy said.