Passing Notebooks in Class: West Big Data Innovation Hub’s UC Berkeley and UW partner with Microsoft on shared data science infrastructure

October 12, 2017

Image result for yuvi panda jupytercon

UC Berkeley Data Science Education Infrastructure Team: Ryan Lovett, Chris Hench, Yuvi Panda, Vinitra Swamy, Gunjan Baid, and Chris Holdgraf at Jupytercon 2017.

In just five semesters at UC Berkeley, Foundations of Data Science (Data 8)(link is external) has grown to over 1,000 students(link is external), making it the fastest-growing course in the university’s history. Teaching data science accessibly and at scale depends on a robust infrastructure. The University of Washington -- which co-leads the National Science Foundation seed-funded West Big Data Innovation Hub with UC Berkeley and UC San Diego -- is the first campus to partner directly with Berkeley’s Data Science Education infrastructure team to deploy the program’s infrastructure and framework outside of Berkeley.

In addition to having a near-50/50 gender ratio mentioned at the Grace Hopper Celebration of Women in Computing(link is external) last week, Berkeley’s Data 8 course maintains all software components as open-source projects and employs Jupyter notebooks for all assignments. It’s a model that’s now spread to a range of other courses at Berkeley. “It was important to our faculty to integrate mechanisms for supporting open science and reproducibility as we developed and taught Data 8,” says Cathryn Carson, Faculty Lead for the Data Science Education Program. “We’re glad if we can smooth the path for other universities to do the same.”

Collaboration with University of Washington

In a similar vein, the new DATA 512: Human-Centered Data Science(link is external) course at UW aims to enable reproducible workflows and foster community dialogue. “I expect that Jupyter notebooks and the course infrastructure we are piloting for the first time will be great teaching and communication tools for reinforcing best practices and supporting experimentation,” notes instructor Jonathan T. Morgan, Senior Design Researcher at the Wikimedia Foundation. The notebooks will help guide students in sharing the story of their research using various tools, from code and data to prose and visualizations, with the goal of making the projects more accessible and impactful for a wider variety of audiences.

Throughout the new Human-Centered Data Science course, students are using Jupyter notebooks for in-class projects and homework assignments that involve a combination of programming, analysis, documentation, and reflection. “Many of the students in this class will soon be out doing data science in the wild, and this exposure to Project Jupyter will probably be their first experience with a professional-class data science programming environment,” says Morgan. “The shared environment promotes transparency and encourages collaboration.”

Indeed, the methods and culture supported through these courses have broader impacts beyond the student experience at the university. Just as all Data 8 lectures, slides, and textbooks are notebooks available online(link is external), the DATA 512 materials are being added on an open wiki(link is external). The course instructor teams encourage others to use the materials and contact the program organizers. “With our new Division of Data Sciences, we are both responding to the opening of research frontiers and growing campus demand for cross-cutting data science course offerings, while spearheading a larger global movement in data science education,” says UC Berkeley’s Interim Dean of Data Sciences David Culler.

The same infrastructure is proving its worth for research collaborations. Jupyter team members Yuvi Panda and Chris Holdgraf (UC Berkeley) along with Carol Willing (Cal Poly) helped stand up a JupyterHub for Neurohackweek(link is external), an NIH-funded summer school in neuroimaging and data science, held at UW’s eScience institute last month.

Accelerating Scaling of JupyterHub

All of this infrastructure is based on the Zero to JupyterHub(link is external) project, which provides a flexible and scalable deployment of JupyterHub that can be deployed on many cloud providers. Maintained by the JupyterHub team, the project aims to make it possible to create a JupyterHub with minimal expertise in system administration or cloud computing, and to customize this JupyterHub to fit a variety of educational and research needs.

Leveraging a small subset of the $3M boost in cloud credits provided by Microsoft(link is external), “the collaboration is designed to demonstrate the feasibility of scaling what is working so well at Berkeley to a broader community,” notes West Big Data Innovation Hub Executive Director Meredith Lee.

“UC Berkeley has set a great example on how to democratize data science education(link is external) through the ingenious use of Jupyter notebook technology on the cloud. The West Big Data Hub is perfectly situated to amplify and accelerate this movement to urgently fulfill the industry and societal demand for qualified data scientists across the Western US,” says Vani Mandava, Director of Data Science, Microsoft Research.

To learn more about the partnerships and activities of the hub, visit http://westbigdatahub.org(link is external) and join the conversation on social media with #BDHubs.