Choosing the Right JupyterHub Infrastructure

SUMMARY

JupyterHub is an open source service that creates on-demand cloud-based Jupyter notebook servers. The project has allowed Berkeley’s data science program to deploy scalable Jupyter infrastructure utilizing cloud computing resources. It enables users to interact remotely with a standardized and common computing environment through any web browser. Compared to local environments that run Jupyter, a cloud-based JupyterHub provides many conveniences, including pre-installed software, quicker access to course content, and computing flexibility that enables even users on Chromebooks or iPads to run Jupyter notebooks.

Choosing the right Jupyter environment infrastructure is often a challenging task. This page goes over a few options and lays out the costs associated with each option. Two different types of JupyterHub implementations intended for smaller class sizes (The Littlest JupyterHub) and larger class sizes (JupyterHub for Kubernetes) are outlined below. Links to resources can be found at the bottom of the page for the most up-to-date information.

DEPLOYMENTS PATHS OVERVIEW

Notebook Services - Hosted for You

Cloud providers such as Google and Microsoft provide free notebook services that are hosted by outside solutions for anyone to develop and run code in a browser quickly. These services typically come with pre-installed packages and support for multiple languages but don’t have the ability to customize one’s hardware set-up and scalability. In that way, they are ideal for one-off demos of notebooks, like setting up a webinar quickly, or classes that have very limited needs/features in terms of notebook/code functionality. Thus, the recommended user group size is under 30 people. If one prefers an open-source solution over cloud provider solutions, Binder is exactly that.

JupyterHub for Kubernetes - Self-Hosted

Integrating JupyterHub with a Kubernetes engine allows for a robust and scalable version of JupyterHub, which gives the ability to deploy containerized environments while not being tied to any one cloud provider. Due to the customizability and scale it provides, the setup and maintenance for this option is the most complicated, taking longer to deploy. With that, this service is recommended for large groups, anywhere from 100 to over 1,000 users. Being self-hosted and open-source, this version has great community support and resources - see the listed resources for more information.

The Littlest JupyterHub - Self-Hosted

Tailored for smaller user sizes, this option allows instructors to create a simple JupyterHub distribution on a single virtual machine. As a result, containerization is not needed to be used for the environments created here. Thus, the recommended user group size is under or around 50 people due to the capacity of single machines, and lack of containerization. Being self-hosted and open-source, this version has great community support and resources - see the listed resources for more information.

Visual Studio Online - Hosted for You

Visual Studio Online provides cloud-powered development environments and instances for any activity. This service supports working with Jupyter Notebooks natively and can be configured to either use JupyterHub as a central hub for all users or individually for each user. A medium-sized user group is recommended for this service, up to 100 people, due to its moderate scalability and fast deployment time.

Local Jupyter Notebooks- Self-Hosted

Perhaps the simplest option, running Jupyter notebooks locally allows one to completely customize their set-up and run JupyterHubs on their own server. This is the original local development environment for Jupyter notebooks that leverages one’s own computer hardware for compute.There is no quite hard limit on how many users should use this option since it is locally-hosted, but it may be more difficult to standardize environments across students. Being self-hosted and open-source, this version has great community support and resources - see the listed resources for more information.


	Colab, Binder, Azure Notebooks	JupyterHub (kubernetes)	The Littlest JupyterHub	Visual Studio Online	Jupyter
Costs	Free	Anywhere from $0.511-0.73 (2 vCPUs, 4 GB RAM) to $2.19-2.56 (8 vCPUs, 16 GB RAM) per user/month for 100 users	Anywhere from $2.61-3.42(4 vCPUs, 16 GB RAM) to $26.08-54.66 (64 vCPUs, 256 GB RAM) per user/month for 50 users.	Anywhere from $2.63 (4 vCPUs, 8 GB RAM, 64 GB HDD) to $3.42 (8 vCPUs, 16 GB RAM, 64 GB HDD) per user/month for 1 user.	Free
Customization	Pre-installed packages + conda, pip, multiple language support (Python, R, F#), no hardware scalability	Existing Docker images, user resource guarantees and limits (CPUs, RAM, storage), Kubernetes cluster(s) size, environment features, hardware components	Pre-installed packages + conda, pip, plugins that provide additional features	Has marketplace for extensions to add to individual user environments Can run most languages and frameworks (Node.js, Python, .NET Core, etc.)	Can customize everything: gives full control and understanding of technical layout during setup - graphics, user interface, configuration, networking, security, etc.
Deployment Process	Sign-up for service online - fast deployment	Cloud provider, Kubernetes, Helm, JupyterHub (optionally Docker) - slower deployment	Cloud provider, server (at least Ubuntu 18.04) - relatively fast deployment	Browser-based version and also downloadable application for Visual Studio - relatively fast deployment	Locally hosted - fast deployment
Scale	1 GB data limit, 4 GB memory limit, single user	Large group of concurrent users (50-400+), computational resources dependent on cloud provider	Small number of concurrent users (20-25), small classes, low maintenance	Any number of users (20 environments per plan x 20 plans per subscription)	N/A

THE LITTLEST JUPYTERHUB COST ESTIMATES

In order to describe the costs of each option accurately, there are three dimensions of the set-up that must be considered: the allotment of memory/RAM, CPU usage, and disk space that each user would need. In that way, varying class sizes and needs differentiate the amount of resources given to each cost dimension. Four major options are outlined in the following chart to determine the allocation of resources across those three types: small classes (~30-50 students), average-sized classes (~80-100 students), and large classes (~100-400+). The following formulas for each of these three dimensions are widely applicable across different scenarios for The Littlest JupyterHub (TLJH):

Recommended Memory = (Maximum Concurrent Users x Maximum Memory per User) + 128 MB
Recommended vCPUs = (Maximum Concurrent Users x Maximum CPU Usage per User) + 20%
Recommended Disk Size = (Total Users x Maximum Disk Usage per User) + 2 GB

Maximum amount of concurrent users should be approximately 40-60% of the total users at any given point
1 GB typically serves as maximum memory per user, with 128 MB being overhead for TLJH and related services
Based on class sizes: 16 GB for 30 students, 64 GB for 100 students, and 256 GB for over 400 students
Most memory, vCPU, and disk installations come hand-in-hand, so memory calculation can be used to determine rest of setup ⇒ 4 vCPUS and 32 GB disk space for small classes, 16 vCPUS and 128 GB disk space for average-sized classes, and 64 vCPUS and 512 GB disk space for large classes


	Google Cloud	Azure	AWS	DigitalOcean
Small Class (~30-50 students) Memory: 16 GB CPU: 4 vCPUs Disk: 32 GB	n1-standard-4: $0.1900/hr = $138.70/month	D4 v3: $0.234/hr = $170.82/month	m5.xlarge: $0.192/hr = $140.16/month	4 vCPU model: $0.179/hr = $130.67/month
Average-Size Class (~80-100 students) Memory: 64 GB CPU: 16 vCPUs Disk: 128 GB	n1-standard-16: $0.7600/hr = $554.80/month	D16 v3: $0.936/hr = $683.28/month	m5.4xlarge: $0.768/hr = $560.64/month	16 vCPU model: $0.714/hr = $521.22/month
Large Class (~400+ students) Memory: 256 GB CPU: 64 vCPUs Disk: 512 GB	n1-standard-64: $3.0400/hr = $2,219.2/month	D64 v3: $3.744/hr = $2,733.12/month	m5.16xlarge: $3.072/hr = $2,242.56/month	40 vCPU model: $1.786/hr = $1,303.78/month

JUPYTERHUB FOR KUBERNETES COST ESTIMATES

In order to describe the costs of each option accurately, there are three dimensions of the set-up that must be considered: the allotment of memory/RAM, CPU usage, and disk space that each user would need. In that way, varying class sizes and needs differentiate the amount of resources given to each cost dimension. Three major options are outlined in the following chart to determine the allocation of resources across those three types: small classes (~30-50 students), average-sized classes (~80-100 students), and large classes (~100+). Autoscaling is a key feature of Kubernetes that is the primary cost saver here - a Kubernetes cluster scales down at night and during weekends, scaling up on demand.

The calculations here are based on Berkeley’s JupyterHub cost estimates, which can be viewed here: https://github.com/data-8/jupyterhub-k8s/blob/master/docs/cost-estimation/gce_budgeting.ipynb(link is external)

The following formulas for each of these three dimensions are widely applicable across different scenarios for JupyterHub for Kubernetes:

Number of Active Pods = Total Users / 4
Recommended Memory = 1 GB x Number of Active Pods

Number of active pods used by classes, on average, fall somewhere between 1/3rd and 1/6th of total users ⇒ dividing by 4 produces a good estimate for this number
Memory allocated to each pod is 1 GB, multiply by number of active pods to get overall memory allocation
Based on class sizes: 8 GB for 30 students, 32 GB for 100 students, and 128 GB for over 400 students
Since most memory, vCPU, and disk installations come hand-in-hand, memory calculation can be used to determine rest of setup ⇒ 2 vCPUS and 16 GB disk space for small classes, 8 vCPUS and 64 GB disk space for average-sized classes, and 32 vCPUS and 256 GB disk space for large classes


	Google Cloud	Azure	AWS	DigitalOcean
Small Class (~30-50 students) Memory: 8 GB CPU: 2 vCPUs Disk: 16 GB	n1-standard-2: $0.0950/hr = $69.35/month	D2 v3: $0.117/hr = $85.41/month	m5.large: $0.096/hr = $70.08/month	2 vCPU model: $0.089/hr = $64.97/month
Average-Size Class (~80-100 students) Memory: 32 GB CPU: 8 vCPUs Disk: 64 GB	n1-standard-8: $0.3800/hr = $277.40/month	D8 v3: $0.468/hr = $341.64/month	m5.2xlarge: $0.384/hr = $280.32/month	8 vCPU model: $0.357/hr = $260.61/month
Large Class (~400+ students) Memory: 128 GB CPU: 32 vCPUs Disk: 256 GB	n1-standard-32: $1.5200/hr = $1,109.6/month	D32 v3: $1.872/hr = $1,366.56/month	m5.8xlarge: $1.536/hr = $1,121.28/month	32 vCPU model: $1.429/hr = $1,043.17/month

FREQUENTLY ASKED QUESTIONS (FAQ)

Should I deploy on my hardware or in the cloud?

Deploying a JupyterHub in the cloud typically works out to be easier, especially if you don’t have much access to technical resources. Having to handle infrastructure reliably takes time and effort, so we recommend the cloud as a scalable and simpler solution.

Is there a particular cloud provider that you recommend?

See the deployment path overview section above. We recommend different cloud providers for different scenarios, although we would recommend using a cloud provider that you have more knowledge/experience with if there is one.

In general, which option would you recommend for different class sizes, The Littlest JupyterHub or JupyterHub on Kubernetes?

For courses of 100 or less students, we recommend The Littlest JupyterHub, and for classes with sizes greater than 100, we recommend JupyterHub on Kubernetes.

What type of hardware should I use?

Our recommendation varies based on the size of the class as well as your own preferences - fewer machines with a lot of RAM, machines with fast CPUs instead, etc. Overall, we have seen that RAM tends to be the largest consideration in our calculations, so we would generally recommend going with the first option.

What are the estimated costs of launching these JupyterHubs?

See our cost estimation tables for both The Littlest JupyterHub and JupyterHub on Kubernetes above.

What are some major technical considerations to keep in mind when using these solutions?

Using our own experiences with these technologies, we have come up with a few major challenges, solutions, and an ideal workflow for administrators.
Challenges:

Environment standardization is difficult. As a result, we don’t recommend having students set up their own environment, especially for an introductory class.
Don’t become dependent on any one cloud provider. Since there are a lot of choices for cloud services, we believe that it is not prudent to be reliant on any particular one of them.
Use platform-agnostic tools. This allows the course infrastructure to be useful for a variety of topics.
Use open-source tools. Otherwise, you may get stuck with the problem of having proprietary software that is not easily generalizable.
Having team members with development and operations skills are not quite as common in academia. They will be necessary in order to help scale the technical solutions listed below.

Solutions:

Harness the cloud’s power. This allows course material to be available to all students regardless of whatever hardware they choose to use.
Abstract away complex APIs and technologies. There are a plethora of different packages and APIs, with each of them having complex underpinnings, and so we try to focus on only the fundamentals of the underlying analysis and set students up for more advanced courses later on.
Use diverse and compelling real-world datasets. These will keep students interested in the course material as they know that what they’re doing is “real” data science.
Anticipate bursts of activity. Students generally do their work during very specific times, such as during class or right before the homework is due. Ensure that your cloud infrastructure is dynamic enough to support that.
Be able to meet maximum demand. If the cloud goes down during a test or right before an assignment is due, it could cause massive logistical problems.
Do all of the above with a small team. The model we present would not be able to scale if it necessitated a large team. We generally have structured our courses such that tech-savvy undergraduates would be able to handle back-end operations.

Using nbgitpuller:

nbgitpuller is a core functionality of JupyterHub and a large advantage over many other proprietary platforms that lets instructors distribute content in a Git repository to students by having them click a simple link while ensuring that students never need to directly interact with Git. It is primarily used with a JupyterHub, but can also work on students' local computers.
Workflow:

Instructor creates some course material to give to students.
Instructor pushes latest version to GitHub and sends students a link to interact with material.
Student clicks on link.
DataHub authenticates user by either having them sign in or checking their computer’s credentials.
DataHub creates and starts a Jupyter instance for user or pulls up a pre-existing environment from a previous session.
Student’s persistent storage volume links to their Jupyter instance.
DataHub clones or pulls content specified by link into student’s instance.
Student is directed to a live, in-browser notebook instance that contains content specified in link and is able to be immediately interacted with.

How can I customize the JupyterHub environment for my class?

See this page(link is external) on Zero to Data 8, which is a resource dedicated to helping administrators understand how to get UC Berkeley’s Data 8 course set up and doubles as a resource for helping get JupyterHubs set up for courses in general.

Where can I find additional help?

The Jupyter Community Forum(link is external) is a one-stop shop for any Jupyter (Hub) related questions and is the place to ask for help.

RESOURCES

Cloud Providers

Google Cloud:

VM instance pricing:https://cloud.google.com/compute/vm-instance-pricing(link is external)
Free tier:https://cloud.google.com/free(link is external)

Microsoft Azure:

VM instance pricing: https://azure.microsoft.com/en-us/pricing/calculator/(link is external)
Free tier:https://azure.microsoft.com/en-us/free/(link is external)

Amazon Web Services (AWS):

VM instance pricing:https://aws.amazon.com/ec2/pricing/on-demand/(link is external)
Free tier:https://aws.amazon.com/free/(link is external)

DigitalOcean:

VM instance pricing:https://www.digitalocean.com/pricing/(link is external)
60-day free trial: https://www.digitalocean.com/community/questions/is-there-a-digitalocean-free-trial-available(link is external)

Pathway Guides

Azure Notebooks:

Website:https://notebooks.azure.com/(link is external)
Signing Up:https://notebooks.azure.com/help/signing-up(link is external)
Documentation:https://notebooks.azure.com/help/jupyter-notebooks(link is external)
Creating a Project: https://notebooks.azure.com/help/projects(link is external)

Zero to JupyterHub:

Website:https://zero-to-jupyterhub.readthedocs.io/en/latest/index.html(link is external)
Creating Kubernetes Clusters:https://zero-to-jupyterhub.readthedocs.io/create-k8s-cluster.html(link is external)
Setting up JupyterHub:https://zero-to-jupyterhub.readthedocs.io/en/latest/setup-jupyterhub(link is external)
Customized Deployments:https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/(link is external)
Administrator Guide:https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator(link is external)

The Littlest JupyterHub (TLJH):

Website:http://tljh.jupyter.org/en/latest/(link is external)
Use Cases:http://tljh.jupyter.org/en/latest/topic/whentouse.html#topic-whentouse(link is external)
Installation:http://tljh.jupyter.org/en/latest/install/index.html(link is external)
How-To Guides:http://tljh.jupyter.org/en/latest/howto/index.html(link is external)
Topic Guides:http://tljh.jupyter.org/en/latest/topic/index.html(link is external)