Vishal desai’s Oracle Blog

May 8, 2019

On-Demand Data science JupyterHub environment using AWS ServiceCatalog

Filed under: AWS, Data Science — vishaldesai @ 5:13 pm

Overview

In previous blog, I demonstrated how to create web based Data Science environment using JupyterHub on Elastic container services. There has to minimum of one task running so if environment is not continuously used customer still has to pay for one task per level of authorization that customer needs. There are different ML stages as shown below in diagram and customer wants to use 1 and 3 stage/tool pattern for their user cases. It will not be economical to use GPU based instances on a persistent basis with web based architecture so customer wants hybrid of web based and on demand environment. In this blog, I will demonstrate how to data scientists can request such on-demand environment using Service Catalog.

image

Architecture

image

0a – Create temporary ubuntu 18.04 EC2 instance and install Anaconda, R, python etc. Create AMI image from EC2 and terminate that temporary EC2 instance.

1a – Create IAM role with policy that will allow read access to PII data in S3 data lake. This role will be used by EC2 instances created when on-demand environment is requested by data scientists.

1b – Create cloud formation template using IAM role and AMI image. In this template only CPU based EC2 instances are allowed for data exploratory type of tasks.

1c – In Service Catalog, create Product using cloud formation template.

1d – In Service Catalog, create PII portfolio and add product to this portfolio.

1e – Create cloud formation template using IAM role and AMI image. In this template only GPU based EC2 instances are allowed for data exploratory, create model, train and evaluate model.

1f – In Service Catalog, create Product using cloud formation template.

1g – Add product to existing PII portfolio.

1i – Add IAM users that will work on PII data to PII portfolio.

2a to 2i – Follow similar steps as above with mapping to nonPII IAM role.

1h – Users can launch product using products assigned to them.

1j – Once the product is launched, users can access JupyterHub environment.

Implementation

0a. Create AMI

Launch any t2 EC2 instance using ubuntu 18.04, login to ec2 instance and run following commands. Once the packages are installed, reboot ec2 instance and create AMI image from it. After image is created terminate EC2 instance.

# Ubuntu updates
sudo apt-get update -y
sudo apt-get dist-upgrade -y
sudo apt-get autoremove -y
sudo apt-get autoclean -y 

# Install Anaconda
sudo curl -O https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh
sudo sh ./Anaconda3-5.2.0-Linux-x86_64.sh -b -p /home/ubuntu/anaconda3

# Install NFS client
sudo apt update
sudo apt-get install nfs-common -y

# Install R pre-requisites
sudo apt-get install -y --no-install-recommends \
fonts-dejavu \
unixodbc \
unixodbc-dev \
r-cran-rodbc \
gfortran \
gcc && \
rm -rf /var/lib/apt/lists/*


# Fix for devtools https://github.com/conda-forge/r-devtools-feedstock/issues/4
sudo ln -s /bin/tar /bin/gtar

# Install Conda, Jupyterhub and R packages
sudo su -
export PATH=$PATH:/home/ubuntu/anaconda3/bin
conda update -n base conda -y
conda create --name jupyter python=3.6 -y
source activate jupyter
conda install -c conda-forge jupyterhub -y
conda install -c conda-forge jupyter notebook -y
conda install -c r r-IRkernel -y
conda install -c r rstudio -y
conda install -c r/label/borked rstudio -y
conda install -c r r-devtools -y
conda install -c r r-ggplot2 r-dplyr -y
conda install -c plotly plotly -y
conda install -c plotly/label/test plotly -y
conda update curl -y
conda install -c bioconda bcftools -y
conda install -c bioconda/label/cf201901 bcftools -y
conda install -c anaconda boto3 -y
pip install boto3
R -e "devtools::install_github('IRkernel/IRkernel')"
R -e "IRkernel::installspec(user = FALSE)"

# Install Jupyterhub
#sudo python3 -m pip install jupyterhub

# Create Config file
mkdir /srv/jupyterhub
echo "c = get_config()" >> /srv/jupyterhub/jupyterhub_config.py
echo "c.Spawner.env_keep = ['AWS_DEFAULT_REGION','AWS_EXECUTION_ENV','AWS_REGION','AWS_CONTAINER_CREDENTIALS_RELATIVE_URI','ECS_CONTAINER_METADATA_URI']" >> /srv/jupyterhub/jupyterhub_config.py
echo "c.Spawner.cmd = ['/home/ubuntu/anaconda3/envs/jupyter/bin/jupyterhub-singleuser']" >> /srv/jupyterhub/jupyterhub_config.py
  

1a. Create IAM roles and policies.

aws iam create-role --role-name "jupyterpii" --description "Allows EC2 to call AWS services on your behalf." --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"Service":["ec2.amazonaws.com"]},"Action":"sts:AssumeRole"}]}' --region us-east-1
aws iam put-role-policy --policy-name "pii" --policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"VisualEditor0","Effect":"Allow","Action":["s3:PutAccountPublicAccessBlock","s3:GetAccountPublicAccessBlock","s3:ListAllMyBuckets","s3:HeadBucket"],"Resource":"*"},{"Sid":"VisualEditor1","Effect":"Allow","Action":"s3:*","Resource":"arn:aws:s3:::vishaldatalake/pii/*"}]}' --role-name "jupyterpii" --region us-east-1
aws iam create-role --role-name "jupyternonpii" --description "Allows EC2 to call AWS services on your behalf." --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"","Effect":"Allow","Principal":{"Service":["ec2.amazonaws.com"]},"Action":"sts:AssumeRole"}]}' --region us-east-1
aws iam put-role-policy --policy-name "nonpii" --policy-document '{"Version":"2012-10-17","Statement":[{"Sid":"VisualEditor0","Effect":"Allow","Action":["s3:PutAccountPublicAccessBlock","s3:GetAccountPublicAccessBlock","s3:ListAllMyBuckets","s3:HeadBucket"],"Resource":"*"},{"Sid":"VisualEditor1","Effect":"Allow","Action":"s3:*","Resource":"arn:aws:s3:::vishaldatalake/nonpii/*"}]}' --role-name "jupyternonpii" --region us-east-1
    
    
  

1b, 1e Create cloud formation templates.

Create EFS mount point and replace template with EFS endpoint. All the notebooks will be stored on shared EFS mount point. Review default parameters and replace it according to your environment.

Templates

1c, 1f, 2c, 2f Create Service Catalog Products

Locate Service Catalog Service, click on upload new product.

image

Click on Next. Enter email contact of product owner and click on next.

image

Choose file, select cloud formation template and click on next. Review details and create products.

Below is the screenshot for all the products.

image

1d, 1g, 2d, 2g Create Service Catalog portfolio and add products.

Click on Create portfolio.

image

Click on Create.

image

Click on one of the portfolios and add PII specific products to PII portfolio and nonPII products to nonPII portfolio.

Below is screenshot of PII portfolio.

image

1i, 2i Create IAM users and give them access to Service Catalog portfolio.

Create IAM users for data scientists.

aws iam create-user --user-name "user1" --path "/" --region us-east-1
aws iam attach-user-policy --policy-arn "arn:aws:iam::aws:policy/AWSServiceCatalogEndUserFullAccess" --user-name "user1" --region us-east-1
aws iam attach-user-policy --policy-arn "arn:aws:iam::aws:policy/IAMUserChangePassword" --user-name "user1" --region us-east-1
aws iam create-login-profile --user-name "user1" --password-reset-required --regiot-1

aws iam create-user --user-name "user2" --path "/" --region us-east-1
aws iam attach-user-policy --policy-arn "arn:aws:iam::aws:policy/AWSServiceCatalogEndUserFullAccess" --user-name "user2" --region us-east-1
aws iam attach-user-policy --policy-arn "arn:aws:iam::aws:policy/IAMUserChangePassword" --user-name "user2" --region us-east-1
aws iam create-login-profile --user-name "user2" --password-reset-required --regiot-1
  

Click on Portfolio and under users, group and role click on Add users, group and role.

image

1h ,2h Login as Data scientist IAM user and launch product.

image

Click on product and launch product.

image

Provide details and click next.

image

Change instance type as per need and click next.

Leave default for Tags and Notifications. Review details and launch product.

image

Once the product is launched it will show JupyterHub url as key value pair.

Launch JupyterHub from browser.

image\

Login using username and password.

1k, 2k Create notebook and test access.

Create notebook and notebook will be stored on EFS mount point.

image

As expected, User can access data from PII folder.

image

User does not have access to nonPII data.

Once the user completes data science or machine learning tasks, product can be terminated by clicking on action and terminate. In future user can launch product and notebooks will be preserved as they are stored on persistent EFS mount.

Additional Considerations

Spot Instances

I have created products using cloud formation that uses on-demand instances. If there is no urgency to complete data exploration, machine learning training, consider creating products that use spot instances which can significantly save cost.

Security

Use certificates and consider launching EC2 products in private subnet for security reasons and access it through bastion.

Active Directory Integration

You can use ldap authenticator to authenticate users through AD.

SageMaker

Product offering can be extended using sage maker and data scientists can have flexibility to use JupyterHub or SageMaker depending upon their requirements.

Cost and Reporting

If users don’t terminate product EC2 will keep incurring additional cost. Lambda can be scheduled to terminate idle tasks or cloudwatch alarm can be created such that if EC2 instances are idle for more than certain period of time then terminate those instances.

As users have control over what type of EC2 instances they can launch, additional reporting can be created using service catalog, EC2 and cost metadata.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: