Overview
Customer has built multi tenant data lake on S3 and have started ingesting different types of data. Now they want to build data science environment for data exploration using JupyterHub. Below were the requirements.
- Environment must be low cost.
- Environment must scale with number of data scientists.
- Environment should support authentication and authorization (S3 data lake).
- Notebooks must be stored in a centralized location and should be sharable.
- Environment must support installing custom packages and libraries such as R, pandas etc..
Customer’s preference is to use JupyterHub and does not want to use EMR due to additional cost.
Architecture


1a. Create IAM policies and roles with access to specific S3 folder. For simplicity lets assume S3 Bucket has two keys/folder called PII and Non-PII. Create policy and role with access to PII and Non-PII.
2a. Create two Dockerfile for authorization purpose. Each Dockerfile will have separate users for authentication and later while creating ECS task definition each image will be attached to different role for authorization. Store Dockerfile in CodeCommit.
2b. CodeBuild will trigger on commit
2c. Code Build will build images using Dockerfile.
2c. CodeBuild will push images in Elastic Container Repository.
Web Based Environment
Single Task can shared by multiple users, Task can scale based on scaling policy, Minimum of one task must be running so customer has to pay for at least one task per task group, CPU and memory limits per task are 4 vCPU and 30 GB of memory.
3a. Create ECS cluster
3b. Create one task definition using the role that has access to PII folder in S3 and image that consists of users who needs access to PII data and other task definition for Non-PII.
3c. Create Services for PII and Non-PII.
3d. Create Application load balancer with routing rules to different services.
3f. Create A-record in Route53 using one of the existing domains.
On Demand Environment
EC2 instance can be provisioned using service catalog, One EC2 instance per User, Users could also share EC2, Customer only pay what they use, wide variety of EC2 options available with much higher CPU, memory compared to ECS. Recommended for ad-hoc and very large data processing use cases.
In this blog, I will cover implementation of Web Based Environment and will cover On Demand Environment in part 2.
Walkthrough Dockerfile
Get the base image, update ubuntu and install jupyter, s3contents, awscli. s3contents is required to store Notebooks on S3.
#Base image
FROM jupyterhub/jupyterhub:latest
#USER root
# update Ubuntu
RUN apt-get update
# Install jupyter, awscli and s3contents (for storing notebooks on S3)
RUN pip install jupyter && \
pip install s3contents && \
pip install awscli --upgrade --user && \
mkdir /etc/jupyter
Install R and required packages.
# R pre-requisites
RUN apt-get update && \
apt-get install -y --no-install-recommends \
fonts-dejavu \
unixodbc \
unixodbc-dev \
r-cran-rodbc \
gfortran \
gcc && \
rm -rf /var/lib/apt/lists/*
# Fix for devtools https://github.com/conda-forge/r-devtools-feedstock/issues/4
RUN ln -s /bin/tar /bin/gtar
# R packages
RUN conda install -c r r-IRkernel && \
conda install -c r rstudio && \
conda install -c r/label/borked rstudio && \
conda install -c r r-devtools && \
conda install -c r r-ggplot2 r-dplyr && \
conda install -c plotly plotly && \
conda install -c plotly/label/test plotly && \
conda update curl && \
conda install -c bioconda bcftools && \
conda install -c bioconda/label/cf201901 bcftools
RUN R -e "devtools::install_github('IRkernel/IRkernel')" && \
R -e "IRkernel::installspec()"
Install S3ContentsManager to store notebooks in centralized S3 location. Although github says it should work with IAM role but I got some errors so as of now I’m using access_key_id and secret_access_key that has read/write access to S3 bucket.
#S3ContentManager Config
RUN echo 'from s3contents import S3ContentsManager' >> /etc/jupyter/jupyter_notebook_config.py && \
echo 'c = get_config()' >> /etc/jupyter/jupyter_notebook_config.py && \
echo 'c.NotebookApp.contents_manager_class = S3ContentsManager' >> /etc/jupyter/jupyter_notebook_config.py && \
echo 'c.S3ContentsManager.access_key_id = "xxxxxxxx"' >> /etc/jupyter/jupyter_notebook_config.py && \
echo 'c.S3ContentsManager.secret_access_key = "xxxxxxxx"' >> /etc/jupyter/jupyter_notebook_config.py && \
echo 'c.S3ContentsManager.bucket = "vishaljuypterhub"' >> /etc/jupyter/jupyter_notebook_config.py
JupyterHub Configuration File.
#JupyterHub Config
RUN echo "c = get_config()" >> /srv/jupyterhub/jupyterhub_config.py && \
echo "c.Spawner.env_keep = ['AWS_DEFAULT_REGION','AWS_EXECUTION_ENV','AWS_REGION','AWS_CONTAINER_CREDENTIALS_RELATIVE_URI','ECS_CONTAINER_METADATA_URI']" >> /srv/jupyterhub/jupyterhub_config.py && \
echo "c.Spawner.cmd = ['/opt/conda/bin/jupyterhub-singleuser']" >> /srv/jupyterhub/jupyterhub_config.py
Add PAM users
#Add PAM users
RUN useradd --create-home user3 && \
echo "user3:user3"|chpasswd && \
echo "export PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" >> /home/user3/.profile && \
mkdir -p /home/user3/.local/share/jupyter/kernels/ir && \
cp /root/.local/share/jupyter/kernels/ir/* /home/user3/.local/share/jupyter/kernels/ir/ && \
chown -R user3:user3 /home/user3
Start JupyterHub using configuration file created earlier.
## Start jupyterhub using config file
CMD ["jupyterhub","-f","/srv/jupyterhub/jupyterhub_config.py"]
Implementation of Web Based Environment
1a. Create IAM Roles and Policies
Create IAM role and policy with access to PII key/folder and non-PII key/folder.
aws iam create-role --role-name "pii" --description "Allows ECS tasks to call AWS services on your behalf." --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"Service":["ecs-tasks.amazonaws.com"]},"Action":"sts:AssumeRole"}]}' --region us-east-1
aws iam put-role-policy --policy-name "pii" --policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"VisualEditor0","Effect":"Allow","Action":["s3:PutAccountPublicAccessBlock","s3:GetAccountPublicAccessBlock","s3:ListAllMyBuckets","s3:HeadBucket"],"Resource":"*"},{"Sid":"VisualEditor1","Effect":"Allow","Action":"s3:*","Resource":"arn:aws:s3:::vishaldatalake/pii/*"}]}' --role-name "pii" --region us-east-1
aws iam create-role --role-name "nonpii" --description "Allows ECS tasks to call AWS services on your behalf." --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"Service":["ecs-tasks.amazonaws.com"]},"Action":"sts:AssumeRole"}]}' --region us-east-1
aws iam put-role-policy --policy-name "nonpii" --policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"VisualEditor0","Effect":"Allow","Action":["s3:PutAccountPublicAccessBlock","s3:GetAccountPublicAccessBlock","s3:ListAllMyBuckets","s3:HeadBucket"],"Resource":"*"},{"Sid":"VisualEditor1","Effect":"Allow","Action":"s3:*","Resource":"arn:aws:s3:::vishaldatalake/nonpii/*"}]}' --role-name "nonpii" --region us-east-1
2c, 2d. Build Docker images and push it to ECR
For sake of brevity, I will skip code commit and code build and show what commands codebuild has to run. There will be two images, one will have users that needs access to PII and another one for non-PII. Instead of two repositories you can also create single repository and create two images with different tags.
cd jupyterhub1
aws ecr create-repository --repository-name jupyterhub/test1
aws ecr get-login --no-include-email --region us-east-1
docker tag jupyterhub/test1:latest xxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/jupyterhub/test1:latest
docker push xxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/jupyterhub/test1:latest
cd jupyterhub2
aws ecr create-repository --repository-name jupyterhub/test2
aws ecr get-login --no-include-email --region us-east-1
docker tag jupyterhub/test2:latest xxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/jupyterhub/test2:latest
docker push xxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/jupyterhub/test2:latest

3a. Create ECS Cluster
Go to ECS service and click on create cluster and choose Networking only. Click on Next Step, provide cluster name and click on create. I have named cluster as jhpoc.

3b. Create Task Definitions
Click on Task Definition, and Create New Task Definition. Select Launch Type compatibility as Fargate. Click on Next Step. Enter following details.
Task Definition Name: jhpocpii
Task Role: pii
Task Memory: 2GB
Task CPU: 1 vCPU
Click on Add container. Enter following details.
Container Name: jhpocpii
Image*: xxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/jupyterhub/test1:latest
Port Mapping: Add 8000 and 80
Click on Add.
Click on Create.
Follow same steps now and create another task definition for nonPII. Use different role nonpii role and test2 image for container.

3d. Create Application Load Balancer.
Create Target groups.
aws elbv2 create-target-group --health-check-interval-seconds 30 --health-check-path "/hub/login" --health-check-protocol "HTTP" --health-check-timeout-seconds 5 --healthy-threshold-count 5 --matcher '{"HttpCode":"200"}' --name "jhpocpii" --port 8000 --protocol "HTTP" --target-type "ip" --unhealthy-threshold-count 2 --vpc-id "vpc-0829259f1492b8986" --region us-east-1
aws elbv2 create-target-group --health-check-interval-seconds 30 --health-check-path "/hub/login" --health-check-protocol "HTTP" --health-check-timeout-seconds 5 --healthy-threshold-count 5 --matcher '{"HttpCode":"200"}' --name "jhpocnonpii" --port 8000 --protocol "HTTP" --target-type "ip" --unhealthy-threshold-count 2 --vpc-id "vpc-0829259f1492b8986" --region us-east-1
Create ALB.
aws elbv2 create-load-balancer --name "jhpocalb1" --scheme "internet-facing" --security-groups '["sg-065547ed77ac48d99"]' --subnets '["subnet-0c90f68bfcc784540","subnet-026d9b30457fcb121"]' --ip-address-type "ipv4" --type "application" --region us-east-1
Create Routing Rules as follows:

3c. Create ECS Services
Click on ECS cluster and on services tab click on create.
Choose Launch Type Fargate.
Choose Task Definition for PII.
Specify Service Name.
Specify Number of Tasks as 1.
Click on Next and uncheck Enable Service Discovery Integration.
Choose VPC, subnets and security group.
For Load Balancer choose ALB.
Choose Load Balancer Name from Drop down.
Choose Container to Load Balancer settings as follows:

Click on Next Step.
Optionally set Auto Scaling as follows:

Click on Create Service.
Create Service for PII and non-PII.
After both the services are created, wait for few minutes until there is one task running for each service.

3f. Create A-records.
Create A-records in Route53.

Test it
Launch jhpocpii.vishalcloud.club:8000. Login as user1 and notice user1 can only access pii data. Trying to login using user3 or user4 will result into authentication error.




Launch jhpocnonpii.vishalcloud.club:8000. Login as user4 and notice user4 can only access non-PII data. Trying to login using user1 or user2 will result into authentication error.




Test R Program.

Important Additional Considerations
Cost
ECS Tasks can be launched using Fargate or EC2. Below matrix shows cost comparison of similar CPU/memory configuration between Fargate and EC2. To save cost, depending upon the type of usage pattern of environment use EC2 for persistent usage or use Fargate for adhoc usage.

Security
Use certificates and consider launching ECS tasks in private subnet for security reasons.
Active Directory Integration
You can use ldap authenticator to authenticate users through AD. Create separate images with different AD group to control authorization.