Table of contents
Introduction
The Inria Grenoble centre Cluster can meet limited computing needs. This service is open to Inria users, with priority access to some of the machines for the research teams that have sponsored their funding.
The tools for using the Cluster are:
- OAR for computer reservation
- Monika for monitoring (see the state of computers)
- Drawgantt to view the machine usage schedule (be patient)
- Singularity to run an application with your own environment
- for simple use-cases, you can use conda in your environment. It is presented in the next section
The list of machine types is available here. The name of the machines are suffixed by the name of the sponsoring team (so priority) otherwise the prefix is cp
The old DSI documentation is Cluster de Centre ( DSI / Teams )
Getting started with the INRIA Grenoble cluster
Connect to the front end
To be able to connect to the front end you must add an ssh public key to your home directory located at
$HOME
when you are on bastion or your team infrastructure.
Generate SSH keys
On your personal machine:
ssh-keygen -t rsa -b 3072 -f ~/.ssh/id_rsa
Do not leave the passphrase field empty. You need to protect your private key with a password for security reasons.
This will generate a private key (~/.ssh/id_rsa
) and a public key (~/.ssh/id_rsa.pub
).
- On your workstation at Inria, you need to place the public key previously generated under
/home/USERNAME/.ssh/id_rsa.pub
. To do so, you can send your public key to yourself via email or use an USB stick. If you are not physically at Inria, you can contact yoursystem administrators
to do it for you.
Do not send the private key via email nor place it on an USB key ! This is not safe. Only your public key needs to be moved.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
At this point, you should be able to execute the following command on your personal machine to connect to bastion
.
ssh -i ~/.ssh/id_rsa <username>@bastion.inrialpes.fr
SSH configuration
You can use a ssh configuration file to make it easier for you. To do so, create or update the file $HOME/.ssh/config
with the following lines on you personal machine:
Host bastion
HostName bastion.inrialpes.fr
User <username>
ProxyCommand none
IdentityFile ~/.ssh/id_rsa
Host *.inrialpes.fr
User <username>
ProxyCommand ssh -W %h:22 bastion
ForwardX11 yes
IdentityFile ~/.ssh/id_rsa
Host access*-cp
ProxyCommand ssh <username>@bastion.inrialpes.fr "/usr/bin/nc %h %p"
Replace the occurences of <username>
with your inria username. You can remove ForwardX11*
lines if you do not need graphical interface. Now you should be able to connect simply by typing : `ssh bastion
.
After these steps, you can normally connect to one of the cluster front end. Cluster is accessed from front-ends access1-cp.inrialpes.fr (Fedora) or acces2-cp.inrialpes.fr(Ubuntu)
$ ssh access2-cp.inrialpes.fr
###################################################
## BASTION SSH de l'INRIA Rhône-Alpes ##
## en cas de probleme merci soumettre un ticket ##
## sur https://helpdesk.inria.fr ##
###################################################
The authenticity of host 'access2-cp.inrialpes.fr (<no hostip for proxy command>)' can not be established.
ECDSA key fingerprint is SHA256:hJTGfwFZvC/8Aw1eQ6KuOjf3IeEbki6jxTsjTLzml48.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'access2-cp.inrialpes.fr' (ECDSA) to the list of known hosts.
Last login: Tue May 24 14:03:30 2022 from 194.199.18.129
your_login@access2-cp:~$
Data storage
There are 2 different storage:
- your home directory which has only 10Go of space.
- share scratch accessible to all team.
You should not store data on your home directory since it has limited space and should be kept for the most important files.
The scratch is available at /services/scratch/TEAM_NAME/USERNAME
If your team folder does not exist you can create a ticket on helpdesk to create it.
Also if your personal folder does not exist, you should be able to create it, you can also ask to your CMI to create it.
The scratch is the prefered space to store data linked to experiment.
Execution environment
Except if you have specific system dependencies, the recommanded way to run your job on the cluster is to use conda. (Otherwise you should use singularity or apptainer. If you need to use it and never used containerization, you should ask for help from someone used to this kind of technology because the learning curve can be really steep.)
To be able to use conda, it is required to install it in /services/scratch/TEAM_NAME/USERNAME, it is probably doable to share a conda installation but you will probably need to modify some permissions.
To install conda follow, the following procedure
- Go to your home directory or to your scratch directory and download miniconda installer:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- then execute the following command
chmod +x Miniconda3-latest-Linux-x86_64.sh && ./Miniconda3-latest-Linux-x86_64.sh
When asked about the installation location, put it into you scratch space : /services/scratch/TEAM_NAME/USERNAME/miniconda3. Then accept to run conda init.
Source your ~/.bashrc for the changes to take action : source ~/.bashrc
Congratulations, now you can create python environments for your projects !
Run your first job
The scheduler used to handle the resources is OAR.
You can run a job with the following command:
oarsub -I
It will run an interactive job
But there are different parameters which allow to specify the server type
Two kind of parameters exist the ones set with -l and the ones step with -p
You can specify an host with:
oarsub -I -p "cluster='thoth' AND host='node3-thoth.inrialpes.fr'"
You have to specify the cluster when targetting a specific node. You can specify multiple node and cluster with a combinaison of OR and AND
You can change the lenght of a job
oarsub -l "walltime=48:0:0" "/path/to/my/script.sh" # launch a job for 48 hours
Ask for 2 GPU on a single node:
oarsub -l "/host=1/gpudevice=2"
Ask for 2 GPU on 2 nodes:
oarsub -l "/host=2/gpudevice=2"
For example, you can also set, a number of cores, a minimum of RAM …
The name of the parameter correspond to the name of the parameter in monika.
Almost every node have been funded by a specific team, the funding teams are priority on the nodes they have funded and as a consequence can kill every best effort job.
If you are not in a team which has funded nodes, your jobs will be launch by default in the best effort queue and idempotent. Which is the lowest priority queue and idempotent means that your job will be relaunch if it gets killed by a default queue job.
You can connect to your running job with:
oarsub -C **JOB_ID**
You can delete your running job with:
oardel **JOB_ID**
You can get extra information concerning a job with
oarstat -f -j **JOB_ID**
You can show currently running and waiting jobs for a user with
oarstat -u **JOB_ID**
Example : Run a script.sh on 4 nodes, using 16 cores on each node, on any cluster, with besteffort, during max 30 minutes:
$ oarsub –l /nodes=4/core=16,walltime=00:30:00 -p "cluster=’SIC’OR cluster=’nanod’ OR cluster=’mistis’
OR cluster=’kinovis’ OR cluster=’beagle’, OR cluster=’perception’ OR cluster=’thoth’"
/services/scratch/morpheo/kinovis/script.sh
Cluster schedule can be monitored on :
http://visu-cp.inrialpes.fr/monika
http://visu-cp.inrialpes.fr/drawgantt
You need to be connected to the VPN.
For more information
For a more elaborate use of the cluster, you need to know a bit more about some tools
Oar
OAR commands:
Cmd | Options | Argument(s) | |
oarstat | shows currently running and waiting jobs, and who they belong to | ||
oarsub | -I | Script or executable | Interactive mode: connects to a slave node and opens a shell for at most 2 hours |
-t besteffort | Runs as besteffort, job can be killed by higher privileged job | ||
-l /nodes=x/cpu=x/core=x,
walltime=xx:xx:xx |
Number of nodes/cores/cpus and duration requested for the job | ||
-p “cluster=’SIC’[OR cluster=’name’]” | Chose the cluster(s) to run on. Can be SIC, kinovis, nanod, beagle, mistis, perception, thoth. Besteffort on other teams clusters | ||
oarsub | -C | JOB_ID | Connects to the job |
oardel | JOB_ID | Kills the job |
The standard output and error are redirected to OAR.JOB_ID.stdout and OAR.JOB_ID.stderr.
Example : Run a script.sh on 4 nodes, using 16 cores on each node, on any cluster, with besteffort, during max 30 minutes:
$ oarsub –l /nodes=4/core=16,walltime=00:30:00 -p "cluster=’SIC’OR cluster=’nanod’ OR cluster=’mistis’ OR cluster=’kinovis’ OR cluster=’beagle’, OR cluster=’perception’ OR cluster=’thoth’" /services/scratch/morpheo/kinovis/script.sh
Singularity or apptainer
Docs
- Introduction to containerization with Singularity (in English)
- A tutorial from IN2P3 to start (in French).
- « Singularity has joined the Linux Foundation and is now Apptainer! »
Converting a Docker image to singularity/apptainer
To perform the conversion Singularity 3.0 minimum or apptainer v1 is required. There are several ways to convert (run directly a docker image, run directly a docker container but there can be some errors) : the best is building a dedicated singularity image from the docker tar archive.
WARNING: You must be the root user to build from a Singularity recipe file, then singularity must be added to sudoer, with setenv to get the singularity env variables
$ vi /etc/sudoers.d/singularity
loginx ALL=(root) NOPASSWD:SETENV: /usr/local/bin/singularity
loginy ALL=(root) NOPASSWD:SETENV: /usr/local/bin/singularity
$ sudo -E singularity build ./my_image_singularity.img docker-archive:./my_image_docker_tar.tar
-> /root is used for cache despite invoking -E with sudo that should tell to use our environment variables !
$ sudo SINGULARITY_CACHEDIR=/scratch/loginx/singularity/ SINGULARITY_TMPDIR=/scratch/loginx/singularity/ singularity build ./my_image_singularity.img docker-archive:./my_image_docker_tar.tar
FINALLY did the job and created a singularity image from the docker one