Pan-Canadian AI Compute Environment Clusters¶
The PAICE (Pan-Canadian AI Compute Environment) clusters are part of the Digital Research Alliance of Canada (DRAC) infrastructure, dedicated exclusively to AI research and reserved for CIFAR AI Chairs. Unlike general DRAC clusters, PAICE clusters use AIP (Artificial Intelligence Project) allocations and follow a tiered access model.
Note
When publishing research using the Pan-Canadian AI Compute Environment (PAICE), acknowledge the Digital Research Alliance of Canada, your specific regional partner and the AI institute that manages your cluster.
Account creation¶
PAICE accounts are hosted on the DRAC CCDB portal. Account creation follows the same steps as for other DRAC clusters. See the DRAC Clusters guide for the full process.
As a Mila researcher, you can request access of all the resources in the Artificial Intelligence tabs in Resources > Access Systems at CCDB.
Access to AIP allocation¶
Before you will be able to submit jobs on PAICE clusters, your professor must add you to their AIP (Artificial Intelligence Project) allocation. To be added, share your CCRI.
Connect to the clusters¶
Connecting to PAICE clusters follows the same steps as DRAC clusters, including setting up SSH keys and configuring multifactor authentication on the CCDB portal. See the DRAC Clusters guide for instructions.
Renewal¶
Account renewal follows the same annual process as other DRAC clusters. See the DRAC Clusters guide for instructions.
Clusters¶
These clusters use AIP allocations (--account=aip-${PI_NAME}), where
${PI_NAME} is the name of your supervising professor. Regular DRAC
allocations and the Mila global allocation will not work.
Access priority is distributed across three tiers based on supervisor affiliation and researcher location. By default, tier 1 and tier 2 together receive 85% of each cluster's resources, while tier 3 receives 10%. Contact your supervisor to determine which tier applies to your research group.
The table below provides information on the allocation depending of the default tier of a Mila researcher for the period which spans from April 7, 2026 to Spring 2027.
| Cluster | CPUs | RGUs allocated | # GPU equiv | Model | Unrestricted internet |
|---|---|---|---|---|---|
| TamIA tier1 + tier2 |
435 | 1738 | 143 | H100-80G H200 |
No |
| Killarney tier3 |
0 | 794 | 75 | H100-80G L40S |
Yes |
| Vulcan tier3 |
0 | 850 | 82 | L40S | No |
Check the current status of the clusters on the DRAC status page.
TamIA¶
Digital Research Alliance of Canada doc
Cluster managed by Mila and Calcul QuΓ©bec, located at UniversitΓ© Laval. Compute resources in TamIA are not assigned to jobs on a per-CPU, but on a per-node basis. No internet access on compute nodes.
Killarney¶
Digital Research Alliance of Canada doc
Cluster managed by Vector and SciNet, located at the University of Toronto.
Vulcan¶
Digital Research Alliance of Canada doc
Cluster managed by the University of Alberta and AMII, located at the University of Alberta. No internet access on compute nodes.
Launching jobs¶
Users must specify the AIP allocation using the flag
--account=aip-${PI_NAME}, where ${PI_NAME} is the name of the
supervising professor. To launch a CPU-only job:
To launch a GPU job:
To get an interactive session:
TamIA: per-node allocation
On TamIA, compute resources are allocated on a per-node basis, not per-CPU. Refer to the TamIA documentation for node specifications and submission guidelines.
The full documentation for job launching on Alliance clusters can be found here.
Storage¶
| Storage | Path | Usage |
|---|---|---|
$HOME |
/home/<user>/ |
Code, specific libraries |
$HOME/projects |
/project/<project> |
Compressed raw datasets |
$SCRATCH |
/scratch/<user> |
Processed datasets, experimental results, logs of experiments |
$SLURM_TMPDIR |
(on compute node) | Temporary job data or results |
When a series of experiments is finished, results should be transferred back to Mila servers.
More details on storage can be found on the DRAC Clusters guide or on DRAC wiki.
Using CometML and Wandb¶
Some compute nodes don't have access to the internet, but there is a special module that can be loaded in order to allow training scripts to access some specific servers, which includes the necessary servers for using CometML and Wandb ("Weights and Biases").
More documentation about this can be found here.
Note
Be careful when using Wandb with httpproxy. It does not support sending
artifacts and wandb's logger will hang in the background when your training
is completed, wasting resources until the job times out. It is recommended
to use the offline mode with wandb instead to avoid such waste.