Computing infrastructure and policies¶

This section seeks to provide factual information and policies on the Mila cluster computing environments.

Roles and authorizations¶

There are mainly two types of researchers statuses at Mila :

Core researchers
Affiliated researchers

This is determined by Mila policy. Core researchers have access to the Mila computing cluster. See your supervisor’s Mila status to know what is your own status.

Overview of available computing resources at Mila¶

The Mila cluster is to be used for regular development and relatively small number of jobs (< 5). It is a heterogeneous cluster. It uses SLURM to schedule jobs.

Mila cluster versus Compute Canada clusters¶

There are a lot of commonalities between the Mila cluster and the clusters from Compute Canada (CC). At the time being, the CC clusters where we have a large allocation of resources are beluga, cedar and graham. We also have comparable computational resources in the Mila cluster, with more to come.

The main distinguishing factor is that we have more control over our own cluster than we have over the ones at Compute Canada. Notably, also, the compute nodes in the Mila cluster all have unrestricted access to the Internet, which is not the case in general for CC clusters (although cedar does allow it).

At the current time of this writing (June 2021), Mila students are advised to use a healthy diet of a mix of Mila and CC clusters. This is especially true in times when your favorite cluster is oversubscribed, because you can easily switch over to a different one if you are used to it.

Guarantees about one GPU as absolute minimum¶

There are certain guarantees that the Mila cluster tries to honor when it comes to giving at minimum one GPU per student, all the time, to be used in interactive mode. This is strictly better than “one GPU per student on average” because it’s a floor meaning that, at any time, you should be able to ask for your GPU, right now, and get it (although it might take a minute for the request to be processed by SLURM).

Interactive sessions are possible on the CC clusters, and there are generally special rules that allow you to get resources more easily if you request them for a very short duration (for testing code before queueing long jobs). You do not get the same guarantee as on the Mila cluster, however.

Node profile description¶

Special nodes and outliers¶

Power9¶

Power9 nodes are using a different processor instruction set than Intel and AMD (x86_64) based nodes. As such you need to setup your environment again for those nodes specifically.

Power9 nodes have 128 threads. (2 processors / 16 cores / 4 way SMT)
4 x V100 SMX2 (16 GB) with NVLink
In a Power9 node GPUs and CPUs communicate with each other using NVLink instead of PCIe. This allow them to communicate quickly between each other. More on Large Model Support (LMS)

Power9 nodes have the same software stack as the regular nodes and each software should be included to deploy your environment as on a regular node.

AMD¶

Warning

As of August 20 2019 the GPUs had to return back to AMD. Mila will get more samples. You can join the amd slack channels to get the latest information

Mila has a few node equipped with MI50 GPUs.

srun --gres=gpu -c 8 --reservation=AMD --pty bash

 first time setup of AMD stack
conda create -n rocm python=3.6
conda activate rocm

pip install tensorflow-rocm
pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl

Monitoring¶

Every compute node on the Mila cluster has a Netdata monitoring daemon allowing you to get a sense of the state of the node. This information is exposed in two ways:

For every node, there is a web interface from Netdata itself at <node>.server.mila.quebec:19999. This is accessible only when using the Mila wifi or through SSH tunnelling.
The Mila dashboard at dashboard.server.mila.quebec exposes aggregated statistics with the use of grafana. These are collected internally to an instance of prometheus.

In both cases, those graphs are not editable by individual users, but they provide valuable insight into the state of the whole cluster or the individual nodes. One of the important uses is to collect data about the health of the Mila cluster and to sound the alarm if outages occur (e.g. if the nodes crash or if GPUs mysteriously become unavailable for SLURM).

Example with Netdata on cn-c001¶

For example, if we have a job running on cn-c001, we can type cn-c001.server.mila.quebec:19999 in a browser address bar and the following page will appear.

Example watching the CPU/RAM/GPU usage¶

Given that compute nodes are generally shared with other users who are also running jobs at the same time and consuming resources, this is not generally a good way to profile your code in fine details. However, it can still be a very useful source of information for getting an idea of whether the machine that you requested is being used in its full capacity.

Given how expensive the GPUs are, it generally makes sense to try to make sure that this resources is always kept busy.

CPU
- iowait (pink line): High values means your model is waiting on IO a lot (disk or network).

CPU RAM
- You can see how much CPU RAM is being used by your script in practice, considering the amount that you requested (e.g. `sbatch --mem=8G ...`).
- GPU usage is generally more important to monitor than CPU RAM. You should not cut it so close to the limit that your experiments randomly fail because they run out of RAM. However, you should not request blindly 32GB of RAM when you actually require only 8GB.

GPU
- Monitors the GPU usage using an nvidia-smi plugin for Netdata.
- You should make sure you use the GPUs to their fullest capacity.
- Select the biggest batch size if possible to increase GPU memory usage and the GPU computational load.
- Spawn multiple experiments if you can fit many on a single GPU. Running 10 independent MNIST experiments on a single GPU will probably take less than 10x the time to run a single one. This assumes that you have more experiments to run, because nothing is gained by gratuitously running experiments.
- You can request a less powerful GPU and leave the more powerful GPUs to other researchers who have experiments that can make best use of them. Sometimes you really just need a k80 and not a v100.

Other users or jobs
- If the node seems unresponsive or slow, it may be useful to check what other tasks are running at the same time on that node. This should not be an issue in general, but in practice it is useful to be able to inspect this to diagnose certain problems.

Example with Mila dashboard¶

Storage¶

Path	Performance	Usage	Quota (Space/Files)	Auto-cleanup
`$HOME` or `/home/mila/<u>/<username>/`	Low	Personal user space Specific libraries, code, binaries	200G/1000K
`/network/projects/<groupname>/`	Fair	Shared space to facilitate collaboration between researchers Long-term project storage	200G/1000K
`/network/data1/`	High	Raw datasets (read only)
`/network/datasets/`	High	Curated raw datasets (read only)
`/miniscratch/`	High	Temporary job results Processed datasets Optimized for small Files Supports ACL to help share the data with others		90 days
`$SLURM_TMPDIR`	Highest	High speed disk for temporary job results	4T/-	at job end

$HOME is appropriate for codes and libraries which are small and read once, as well as the experimental results that would be needed at a later time (e.g. the weights of a network referenced in a paper).
projects can be used for collaborative projects. It aims to ease the sharing of data between users working on a long-term project. It’s possible to request a bigger quota if the project requires it.
datasets contains curated datasets to the benefit of the Mila community. To request the addition of a dataset or a preprocessed dataset you think could benefit the research of others, you can fill this form.
data1 should only contain compressed datasets. Now deprecated and replaced by the datasets space.
miniscratch can be used to store processed datasets, work in progress datasets or temporary job results. Its block size is optimized for small files which minimizes the performance hit of working on extracted datasets. It aims to support Access Control Lists (ACLs) which can be used to share data between users. This space is cleared weekly and files older then 90 days will be deleted. See note on ACL availability above.
$SLURM_TMPDIR points to the local disk of the node on which a job is running. It should be used to copy the data on the node at the beginning of the job and write intermediate checkpoints. This folder is cleared after each job.

Note

Auto-cleanup is applied on files not read or modified during the specified period