How to set up and use an HPC Cluster to scale up ML experiments 🚀
This blog post is meant to guide you through how to easily use and set up an HPC cluster. This is the first article in a 3-part series about multi-node machine learning. You can also check out:
⚠️ Note: This article uses the AWS Parallel Cluster as an example of how to set up a cloud-based cluster.
A High Performance Computing cluster is a collection of interconnected computers designed to handle computationally demanding tasks, and is thus often used for machine learning workloads.
While traditional on premise HPC clusters have their own advantages, such as better network communication or more control over each component, they require significant investment in hardware and expertise to set up and maintain the infrastructure.
Cloud-based clusters, on the other hand, are easier to set up and require less maintenance. The main advantages are:
An HPC cluster usually has the following components:
You can create a new AWS parallel cluster at no additional cost, you only pay for resources needed to run it.
Install AWS ParallelCluster:
Install nvm if applicable. Node.js is needed by AWS Cloud Development Kit (AWS CDK) used to define cloud infrastructure as code through Cloud Formation. This enables a user to create a template that specifies the cluster configuration, which will be used to generate the cluster.
Verify that AWS ParallelCluster is installed correctly:
To create the cluster, you need to create a yaml config file with the desired specifications. You can choose different AWS EC2 instance types.
Below is an example of a config file with:
⚠️ Note: It is important to specify the SubnetID under Networking to make sure the cluster isn’t visible to everyone externally. The different subnets can be seen under the Virtual Private Cloud section on AWS.
⚠️ Note: Make sure to check the pricing of each instance type, as well as the pricing of additional shared storage such as FSx for Lustre before trying it out on the cluster. If you just want to play around with it, you might want to opt for a queue with some general-purpose instances first (such as t2 or t3) and leave out the additional shared storage.
⚠️ Note: When configuring options for the FSx Lustre storage, be careful when picking between scratch and persistent file systems. While scratch file systems are more cost-effective and can be a good option for temporary storage, your data will be wiped out if a file server fails. You can find more details about this here.
Create the cluster by specifying the config file and the cluster name:
You can check the status of the cluster with:
Once the cluster status is CREATE_COMPLETE you can ssh into the head node with:
You should now be able to see your ip address when you ssh into the head node in the terminal after your username:
To be able to use ssh parallel-cluster directly, add your cluster details locally to ~/.ssh/config:
Once you’re logged in on the head node, run sinfo to verify that the compute nodes are set up and configured:
You can also check GPUs are visible by running:
First, ssh into the head node and create a bash script that takes in the new user name and their public ssh key, and then creates a userlistfile under /opt/parallelcluster/shared with the user name and UID:
Then, back on your local machine, create a create-users.sh script with the following:
Upload the script to the s3 bucket created by the parallel cluster:
Update your cluster config to configure the node using this file for each Slurm queue and add access to the above s3 bucket (you only need to do this once):
Update the cluster (make sure no jobs are running, as this will stop the compute fleet):
Once the cluster is updated and running, the new user should be able to access the cluster with:
⚠️ Note: Next time a user needs to be added you can just run the add_users.sh bash script from the first step (no need to update the cluster again).
Slurm accounting can be used to collect information for every job and job step executed.
Run sh script below with sudo to install mariadb and set up Slurm accounting:
Check out if accounting works by running sacct.
A usual workflow could consist of:
💡 Note: The head node and compute nodes share the same filesystem, so you will be able to access the same virtual environment/directories from the compute nodes as well.
To ssh directly into one of the compute nodes you can launch an interactive session with Slurm that will open a bash terminal on that node:
You can now run a debugger from the launched bash shell. If you have a remote cluster and an IDE that supports remote development e.g. VS Code, you can add the new node host to your local config file e.g. in ~/.ssh/config. To find the name of the new node run squeue and look for your job and check which node it is using e.g. gpu-queue-dy-g52xlarge-1.
You should now be able to find the new host in your IDE remote explorer under ssh hosts and launch a new window from there.
💡 Note: Since the host key of the instance will change every time you launch an interactive session, you can add StrictHostKeyChecking=no and UserKnownHostsFile=/dev/null to the config to stop checking the host key or adding it to the known hosts file.
⚠️ Note: Once the time limit is reached, the node won’t be accessible anymore, so you’ll have to launch a new interactive bash session.
When you first try to use an HPC cluster to scale up your job you might feel quite excited to just sit back, relax, and watch as your code scales perfectly across 1,000 nodes.
What you often get instead is a sad, droopy curve that looks like it gave up on life halfway through.
One of the usual suspects to look out for when this happens is low network bandwidth. This means that the nodes are not exchanging information quick enough, resulting in a communication overhead that can become a bottleneck for your job. Cloud-based clusters in particular can suffer from this as the individual compute nodes are not usually located as closely together as the on-premise equivalent.
Nevertheless, there is hope. Here are some ways to increase the network bandwidth:
⚠️ Note: Distributed ML packages like Pytorch’s torch.distributed use NVIDIA’s Collective Communications Library (NCCL) backend to implement multi-gpu and multi-node primitives (e.g. operations like all-gather or all-reduce). To make full use of EFA for ML tasks, you’ll need to make sure NCCL will work with it. At the time of writing, NCCL with EFA is supported only with p3dn.24xlarge, p4d.24xlarge, and p5.48xlarge instances.
If you have the instance type that supports EFA, you can enable it by adding it as an option to the cluster config and update the cluster:
Once the cluster is updated, you can check it works with NCCL by installing and running nccl-tests, which will test the performance and correctness of NCCL operations.
Here’s an example of testing the performance of the all-reduce operation by running all_reduce_perf:
💡 Note: Running module load X will load all relevant environment variables/packages. If module is not recognised, you might need to run source /etc/profile.d/modules.sh.
💡 Note: The all-reduce operation is a collective communication operation that is used to reduce data across multiple GPUs/nodes in distributed training. Some examples of reduce operations are: sum/min/max.
Containers are usually used to package applications and their dependencies together, so that the application will run the same way regardless of where it’s deployed. This make it easy to share and reproduce them across different stages of development, testing, and production.
Since containers need privileged runtime i.e. having root access, you might wonder if it’s possible to use containers on the cluster. Nvidia’s enroot tool gets around this limitation by using chroot(1) to create an isolated runtime environment for the container, limiting its access to a subset of the filesystem.
To actually run the containers, Nvidia’s pyxis plugin for Slurm allows you to run containers by only specifying the container uri i.e. amazonlinux/latest.
For a great tutorial on how to set up these tools on AWS Parallel Cluster, check out this blog post.
Hopefully by now, setting up a cluster and using it looks less intimidating. Check out Part 2 of this series about how to submit jobs with Slurm.