Interested to set up Datahub for your class?

Datahub creates on-demand cloud-based Jupyter notebook and R Studio notebook servers and are the basis of the technical infrastructure for Data 8 and related courses.

The main Datahub deployment is at datahub.berkeley.edu. In addition there are several other hub archetypes serving diverse instructional needs of the Berkeley instructors.

Jupyter Notebook Examples

City Planning

Notebook Example from City Planning 88

Introductory notebook in Python exploring the concept of node-based networks in connector course City Planning 88 taught by Marta Gonzalez.

Political Science

Notebook Example from Political Science 3

Introductory notebook in R exploring the question of whether politicians racially discriminate against their constituents in Introduction to Empirical Analysis and Quantitative Methods course taught by David Broockman.

Biology

Notebook Example from Biology 1B

Notebook in Python analyzing the data collected from the North and South Forks of Strawberry Creek in Evolution, Ecology, and Organizational Diversity course

ethnic studies

Notebook Example from Ethnic Studies 21AC

Notebook in Python analyzing the incarceration trends and impacts of prison realignment in California as part of Ethnic Studies modules taught by Victoria E Robinson.

Hub Archetypes

DATAHUB

Datahub

Datahub provides standard computing infrastructure to many foundational courses across diverse disciplines. Instructors who are interested to run their Jupyter based workflow use Datahub. Datahub provides standard computing infrastructure, package management in Python, and storage solutions catering to the instructional requirement of many introductory data science courses.

R Hub

R Hub

R Hub provides standard computing infrastructure to instructors using R-based tools (RStudio IDE, Jupyter R). R Hub is widely used by instructors teaching quantitative social science courses. Fun fact: Infrastructure team within Berkeley made an immense contribution to the Jupyterhub ecosystem by adding R Studio as part of the standard offering which improved access to R based instructors.

Berkeley Department of Integrative Biology

Biology Hub

Biology hub is a compute-intensive infrastructure tailored towards the needs of instructors in Biology and Genomics. Hub provides additional compute to support the complex data science use cases requiring large datasets as part of the courses taught eg: Hub supports compute intensive workflow to analyze large datasets in Genome sequencing.

BERK_STATISTICS

Stat 159 Hub

Stat 159 Hub is an innovation hub tailored to the needs of the Stat 159 course taught by Fernando Perez. One of the objectives for this hub is to make it a "home away from home" for students enrolled in this course. Students will use the hub like their local setup and will utilize some of the advanced Datahub use-cases which include remote desktop environment in Linux, secure access to GitHub, Dropbox-like functionality to share files, Real-time collaboration, Real time file sharing etc.

Datahub Principles

INCLUSION ICON

Inclusion

Datahub is built with the principle of inclusion in mind. Any instructor irrespective of their domain can expose their students to data science workflow using Datahub. 

ACCESSIBILITY ICON

Accessibility

Datahub completely removes the dependency on the student's local desktop configuration in order to run their Data Science workflow. Datahub provides the required infrastructure including the storage and compute in an equitable manner for all students 

OPEN SOURCE ICON

Open Source

Datahub is built with an open-source ethos in mind. Datahub is completely free of cost, and no licensing is required for the instructors/students to access the infrastructure. In addition, The team behind Datahub has a strong connection with the open-source ecosystem including the Jupyter ecosystem.

 

SCALABILITY ICON

Scalability

Datahub was initially piloted in Spring 2017 as part of a small classroom of 50+ students in Data 8. At the start of Spring 2022 semester, Datahub supports almost 1500+ students who are enrolled in Data 8. Datahub Infrastructure’s ability to handle the growth in Data 8 is a huge testament to its scalability. 

Datahub Metrics

4000+

DAU

daily active users

10,000+

MAU

monthly active users

150

Usage

number of student years spent using Datahub since 2019

$3

Cost

cloud costs per student per semester

Datahub Testimonials

Image
Douglas Dreger

Douglas Dreger

Professor of Geophysics

Jupyter environment on Datahub has greatly improved the overall outcome of the assignments, where the more experienced students continue to do well, but the less computer-savvy students are also doing well in the assignments. Variable time is therefore spent on the seismological lessons rather than figuring out how to do something and whether Excel or Matlab should be used. The process of defining assignments on bcourses and then linking to the assignment on Datahub works very well. In addition, when demonstrating or teaching how to approach a problem, it is very useful for the instructor to do so on exactly the same platform the students will be using for the assignment.

Image
David Broockman

David Broockman

Assistant Professor at Political Science

My advice to other faculty would be that it does take a bit of learning the tools up front, but that the team does a great job of teaching you how to use the tools and supporting you. Then, once you get it, it's really seamless and an amazing stack of tools. I can't imagine not teaching my class any other way.

Image
Marta Gonzalez

Marta Gonzalez

Assistant Professor at Civil and Environmental Engineering

First, I had classes in Matlab but later shifted to Jupyter notebooks. Notebooks are very easy to teach from a pedagogical standpoint. The power of Datahub is that students can connect remotely to use Jupyter notebooks. Key advantages are that the packages are pre-installed and students need not have the latest environments or powerful computers. It makes such environments accessible. Through Datahub, We can do magic things with Data.