SUMMARY

JupyterHub is an open source service that creates on-demand cloud-based Jupyter notebook servers. The project has allowed Berkeley’s data science program to deploy scalable Jupyter infrastructure utilizing cloud computing resources. It enables users to interact remotely with a standardized and common computing environment through any web browser. Compared to local environments that run Jupyter, a cloud-based JupyterHub provides many conveniences, including pre-installed software, quicker access to course content, and computing flexibility that enables even users on Chromebooks or iPads to run Jupyter notebooks. 

Choosing the right Jupyter environment infrastructure is often a challenging task. This page goes over a few options and lays out the costs associated with each option. Two different types of JupyterHub implementations intended for smaller class sizes (The Littlest JupyterHub) and larger class sizes (JupyterHub for Kubernetes) are outlined below. Links to resources can be found at the bottom of the page for the most up-to-date information.

DEPLOYMENTS PATHS OVERVIEW

Notebook Services - Hosted for You

Cloud providers such as Google and Microsoft provide free notebook services that are hosted by outside solutions for anyone to develop and run code in a browser quickly. These services typically come with pre-installed packages and support for multiple languages but don’t have the ability to customize one’s hardware set-up and scalability. In that way,  they are ideal for one-off demos of notebooks, like setting up a webinar quickly, or classes that have very limited needs/features in terms of notebook/code functionality. Thus, the recommended user group size is under 30 people. If one prefers an open-source solution over cloud provider solutions, Binder is exactly that.


      

JupyterHub for Kubernetes - Self-Hosted

Integrating JupyterHub with a Kubernetes engine allows for a robust and scalable version of JupyterHub,  which gives the ability to deploy containerized environments while not being tied to any one cloud provider. Due to the customizability and scale it provides, the setup and maintenance for this option is the most complicated, taking longer to deploy. With that, this service is recommended for large groups, anywhere from 100 to over 1,000 users. Being self-hosted and open-source, this version has great community support and resources - see the listed resources for more information.

    

The Littlest JupyterHub - Self-Hosted

Tailored for smaller user sizes, this option allows instructors to create a simple JupyterHub distribution on a single virtual machine. As a result, containerization is not needed to be used for the environments created here. Thus, the recommended user group size is under or around 50 people due to the capacity of single machines, and lack of containerization. Being self-hosted and open-source, this version has great community support and resources - see the listed resources for more information.

  

Visual Studio Online - Hosted for You

Visual Studio Online provides cloud-powered development environments and instances for any activity. This service supports working with Jupyter Notebooks natively and can be configured to either use JupyterHub as a central hub for all users or individually for each user. A medium-sized user group is recommended for this service, up to 100 people, due to its moderate scalability and fast deployment time.

    

Local Jupyter Notebooks- Self-Hosted

Perhaps the simplest option, running Jupyter notebooks locally allows one to completely customize their set-up and run JupyterHubs on their own server. This is the original local development environment for Jupyter notebooks that leverages one’s own computer hardware for compute.There is no quite hard limit on how many users should use this option since it is locally-hosted, but it may be more difficult to standardize environments across students. Being self-hosted and open-source, this version has great community support and resources - see the listed resources for more information.

Colab, Binder, Azure NotebooksJupyterHub (kubernetes)The Littlest JupyterHubVisual Studio OnlineJupyter
CostsFreeAnywhere from $0.511-0.73 (2 vCPUs, 4 GB RAM) to $2.19-2.56 (8 vCPUs, 16 GB RAM) per user/month for 100 usersAnywhere from $2.61-3.42(4 vCPUs, 16 GB RAM) to $26.08-54.66 (64 vCPUs, 256 GB RAM) per user/month for 50 users.Anywhere from $2.63 (4 vCPUs, 8 GB RAM, 64 GB HDD) to $3.42 (8 vCPUs, 16 GB RAM, 64 GB HDD) per user/month for 1 user.Free
CustomizationPre-installed packages + conda, pip, multiple language support (Python, R, F#), no hardware scalabilityExisting Docker images, user resource guarantees and limits (CPUs, RAM, storage), Kubernetes cluster(s) size, environment features, hardware componentsPre-installed packages + conda, pip, plugins that provide additional featuresHas marketplace for extensions to add to individual user environments Can run most languages and frameworks (Node.js, Python, .NET Core, etc.)Can customize everything: gives full control and understanding of technical layout during setup - graphics, user interface, configuration, networking, security, etc.
Deployment ProcessSign-up for service online - fast deploymentCloud provider, Kubernetes, Helm, JupyterHub (optionally Docker) - slower deploymentCloud provider, server (at least Ubuntu 18.04) - relatively fast deploymentBrowser-based version and also downloadable application for Visual Studio - relatively fast deploymentLocally hosted - fast deployment
Scale1 GB data limit, 4 GB memory limit, single userLarge group of concurrent users (50-400+), computational resources dependent on cloud providerSmall number of concurrent users (20-25), small classes, low maintenanceAny number of users (20 environments per plan x 20 plans per subscription)N/A

THE LITTLEST JUPYTERHUB COST ESTIMATES

 

In order to describe the costs of each option accurately, there are three dimensions of the set-up that must be considered: the allotment of memory/RAM, CPU usage, and disk space that each user would need. In that way, varying class sizes and needs differentiate the amount of resources given to each cost dimension. Four major options are outlined in the following chart to determine the allocation of resources across those three types: small classes (~30-50 students), average-sized classes (~80-100 students), and large classes (~100-400+). The following formulas for each of these three dimensions are widely applicable across different scenarios for The Littlest JupyterHub (TLJH):

  1. Recommended Memory = (Maximum Concurrent Users x Maximum Memory per User) + 128 MB
  2. Recommended vCPUs = (Maximum Concurrent Users x Maximum CPU Usage per User) + 20%
  3. Recommended Disk Size = (Total Users x Maximum Disk Usage per User) + 2 GB
  • Maximum amount of concurrent users should be approximately 40-60% of the total users at any given point
  • 1 GB typically serves as maximum memory per user, with 128 MB being overhead for TLJH and related services
  • Based on class sizes: 16 GB for 30 students, 64 GB for 100 students, and 256 GB for over 400 students
  • Most memory, vCPU, and disk installations come hand-in-hand, so memory calculation can be used to determine rest of setup ⇒ 4 vCPUS and 32 GB disk space for small classes, 16 vCPUS and 128 GB disk space for average-sized classes, and 64 vCPUS and 512 GB disk space for large classes
Google CloudAzureAWSDigitalOcean
Small Class (~30-50 students) Memory: 16 GB CPU: 4 vCPUs Disk: 32 GBn1-standard-4: $0.1900/hr = $138.70/monthD4 v3: $0.234/hr = $170.82/monthm5.xlarge: $0.192/hr = $140.16/month4 vCPU model: $0.179/hr = $130.67/month
Average-Size Class (~80-100 students) Memory: 64 GB CPU: 16 vCPUs Disk: 128 GBn1-standard-16: $0.7600/hr = $554.80/monthD16 v3: $0.936/hr = $683.28/monthm5.4xlarge: $0.768/hr = $560.64/month16 vCPU model: $0.714/hr = $521.22/month
Large Class (~400+ students) Memory: 256 GB CPU: 64 vCPUs Disk: 512 GBn1-standard-64: $3.0400/hr = $2,219.2/monthD64 v3: $3.744/hr = $2,733.12/monthm5.16xlarge: $3.072/hr = $2,242.56/month40 vCPU model: $1.786/hr = $1,303.78/month

JUPYTERHUB FOR KUBERNETES COST ESTIMATES

In order to describe the costs of each option accurately, there are three dimensions of the set-up that must be considered: the allotment of memory/RAM, CPU usage, and disk space that each user would need. In that way, varying class sizes and needs differentiate the amount of resources given to each cost dimension. Three major options are outlined in the following chart to determine the allocation of resources across those three types: small classes (~30-50 students), average-sized classes (~80-100 students), and large classes (~100+). Autoscaling is a key feature of Kubernetes that is the primary cost saver here - a Kubernetes cluster scales down at night and during weekends, scaling up on demand. 

The calculations here are based on Berkeley’s JupyterHub cost estimates, which can be viewed here: https://github.com/data-8/jupyterhub-k8s/blob/master/docs/cost-estimation/gce_budgeting.ipynb(link is external)

The following formulas for each of these three dimensions are widely applicable across different scenarios for JupyterHub for Kubernetes:

  1. Number of Active Pods = Total Users / 4

  2. Recommended Memory = 1 GB x Number of Active Pods

  • Number of active pods used by classes, on average, fall somewhere between 1/3rd and 1/6th of total users ⇒ dividing by 4 produces a good estimate for this number

  • Memory allocated to each pod is 1 GB, multiply by number of active pods to get  overall memory allocation 

  • Based on class sizes: 8 GB for 30 students, 32 GB for 100 students, and 128 GB for over 400 students

  • Since most memory, vCPU, and disk installations come hand-in-hand, memory calculation can be used to determine rest of setup ⇒ 2 vCPUS and 16 GB disk space for small classes, 8 vCPUS and 64 GB disk space for average-sized classes, and 32 vCPUS and 256 GB disk space for large classes

Google CloudAzureAWSDigitalOcean
Small Class (~30-50 students) Memory: 8 GB CPU: 2 vCPUs Disk: 16 GBn1-standard-2: $0.0950/hr = $69.35/monthD2 v3: $0.117/hr = $85.41/monthm5.large: $0.096/hr = $70.08/month2 vCPU model: $0.089/hr = $64.97/month
Average-Size Class (~80-100 students) Memory: 32 GB CPU: 8 vCPUs Disk: 64 GBn1-standard-8: $0.3800/hr = $277.40/monthD8 v3: $0.468/hr = $341.64/monthm5.2xlarge: $0.384/hr = $280.32/month8 vCPU model: $0.357/hr = $260.61/month
Large Class (~400+ students) Memory: 128 GB CPU: 32 vCPUs Disk: 256 GBn1-standard-32: $1.5200/hr = $1,109.6/monthD32 v3: $1.872/hr = $1,366.56/monthm5.8xlarge: $1.536/hr = $1,121.28/month32 vCPU model: $1.429/hr = $1,043.17/month

FREQUENTLY ASKED QUESTIONS (FAQ) 

  • Should I deploy on my hardware or in the cloud?

    • Deploying a JupyterHub in the cloud typically works out to be easier, especially if you don’t have much access to technical resources. Having to handle infrastructure reliably takes time and effort, so we recommend the cloud as a scalable and simpler solution.

  • Is there a particular cloud provider that you recommend?

    • See the deployment path overview section above. We recommend different cloud providers for different scenarios, although we would recommend using a cloud provider that you have more knowledge/experience with if there is one.

  • In general, which option would you recommend for different class sizes, The Littlest JupyterHub or JupyterHub on Kubernetes? 

    • For courses of 100 or less students, we recommend The Littlest JupyterHub, and for classes with sizes greater than 100, we recommend JupyterHub on Kubernetes.

  • What type of hardware should I use? 

    • Our recommendation varies based on the size of the class as well as your own preferences - fewer machines with a lot of RAM, machines with fast CPUs instead, etc. Overall, we have seen that RAM tends to be the largest consideration in our calculations, so we would generally recommend going with the first option.

  • What are the estimated costs of launching these JupyterHubs?

    • See our cost estimation tables for both The Littlest JupyterHub and JupyterHub on Kubernetes above. 

  • What are some major technical considerations to keep in mind when using these solutions?

    • Using our own experiences with these technologies, we have come up with a few major challenges, solutions, and an ideal workflow for administrators.

    • Challenges:

      • Environment standardization is difficult. As a result, we don’t recommend having students set up their own environment, especially for an introductory class.

      • Don’t become dependent on any one cloud provider. Since there are a lot of choices for cloud services,  we believe that it is not prudent to be reliant on any particular one of them.

      • Use platform-agnostic tools. This allows the course infrastructure to be useful for a variety of topics.

      • Use open-source tools. Otherwise, you may get stuck with the problem of having proprietary software that is not easily generalizable.

      • Having team members with development and operations skills are not quite as common in academia. They will be necessary in order to help scale the technical solutions listed below.

    • Solutions:

      • Harness the cloud’s power. This allows course material to be available to all students regardless of whatever hardware they choose to use.

      • Abstract away complex APIs and technologies. There are a plethora of different packages and APIs, with each of them having complex underpinnings, and so we try to focus on only the fundamentals of the underlying analysis and set students up for more advanced courses later on. 

      • Use diverse and compelling real-world datasets. These will keep students interested in the course material as they know that what they’re doing is “real” data science.

      • Anticipate bursts of activity. Students generally do their work during very specific times, such as during class or right before the homework is due. Ensure that your cloud infrastructure is dynamic enough to support that.

      • Be able to meet maximum demand. If the cloud goes down during a test or right before an assignment is due, it could cause massive logistical problems. 

      • Do all of the above with a small team. The model we present would not be able to scale if it necessitated a large team. We generally have structured our courses such that tech-savvy undergraduates would be able to handle back-end operations.

    • Using nbgitpuller:

      • nbgitpuller is a core functionality of JupyterHub and a large advantage over many other proprietary platforms that lets instructors distribute content in a Git repository to students by having them click a simple link while ensuring that students never need to directly interact with Git. It is primarily used with a JupyterHub, but can also work on students' local computers.

      • Workflow:

        • Instructor creates some course material to give to students.

        • Instructor pushes latest version to GitHub and sends students a link to interact with material.

        • Student clicks on link.

        • DataHub authenticates user by either having them sign in or checking their computer’s credentials.

        • DataHub creates and starts a Jupyter instance for user or pulls up a pre-existing environment from a previous session.

        • Student’s persistent storage volume links to their Jupyter instance.

        • DataHub clones or pulls content specified by link into student’s instance.

        • Student is directed to a live, in-browser notebook instance that contains content specified in link and is able to be immediately interacted with.

  • How can I customize the JupyterHub environment for my class?

    • See this page(link is external) on Zero to Data 8, which is a resource dedicated to helping administrators understand how to get UC Berkeley’s Data 8 course set up and doubles as a resource for helping get JupyterHubs set up for courses in general.

RESOURCES

Cloud Providers

Pathway Guides