INRIA Grenoble Cluster Access – Usage

Introduction

The Inria Grenoble centre Cluster can meet limited computing needs. This service is open to Inria users, with priority access to some of the machines for the research teams that have sponsored their funding.

The tools for using the Cluster are:

  • OAR for computer reservation
  • Monika for monitoring (see the state of computers)
  • Drawgantt to view the machine usage schedule (be patient)
  • Singularity to run an application with your own environment
  • for simple use-cases, you can use conda in your environment. It is presented in the next section

The list of machine types is available here. The name of the machines are suffixed by the name of the sponsoring team (so priority) otherwise the prefix is cp

The old DSI documentation is Cluster de Centre ( DSI / Teams )

Getting started with the INRIA Grenoble cluster

Connect to the front end

To be able to connect to the front end you must add an ssh public key to your home directory located at

$HOME

when you are on bastion or your team infrastructure.

Generate SSH keys

On your personal machine:

ssh-keygen -t rsa -b 3072 -f ~/.ssh/id_rsa

Do not leave the passphrase field empty. You need to protect your private key with a password for security reasons.

This will generate a private key (~/.ssh/id_rsa) and a public key (~/.ssh/id_rsa.pub).

  • On your workstation at Inria, you need to place the public key previously generated under /home/USERNAME/.ssh/id_rsa.pub. To do so, you can send your public key to yourself via email or use an USB stick. If you are not physically at Inria, you can contact your system administrators to do it for you.

Do not send the private key via email nor place it on an USB key ! This is not safe. Only your public key needs to be moved.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

At this point, you should be able to execute the following command on your personal machine to connect to bastion.

ssh -i ~/.ssh/id_rsa <username>@bastion.inrialpes.fr

SSH configuration

You can use a ssh configuration file to make it easier for you. To do so, create or update the file $HOME/.ssh/config with the following lines on you personal machine:

Host bastion 
    HostName bastion.inrialpes.fr
    User <username>
    ProxyCommand none
    IdentityFile ~/.ssh/id_rsa 
Host *.inrialpes.fr
    User <username>
    ProxyCommand ssh -W %h:22 bastion
    ForwardX11 yes
    IdentityFile ~/.ssh/id_rsa
Host access*-cp
    ProxyCommand ssh <username>@bastion.inrialpes.fr "/usr/bin/nc %h %p"

Replace the occurences of <username> with your inria username. You can remove ForwardX11* lines if you do not need graphical interface. Now you should be able to connect simply by typing : `ssh bastion.

After these steps, you can normally connect to one of the cluster front end. Cluster is accessed from front-ends access1-cp.inrialpes.fr (Fedora) or acces2-cp.inrialpes.fr(Ubuntu)

$ ssh access2-cp.inrialpes.fr
###################################################
##      BASTION SSH de l'INRIA Rhône-Alpes       ##
## en cas de probleme merci soumettre un ticket  ##
##         sur https://helpdesk.inria.fr         ##
###################################################
The authenticity of host 'access2-cp.inrialpes.fr (<no hostip for proxy command>)' can not be established.
ECDSA key fingerprint is SHA256:hJTGfwFZvC/8Aw1eQ6KuOjf3IeEbki6jxTsjTLzml48.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'access2-cp.inrialpes.fr' (ECDSA) to the list of known hosts.
Last login: Tue May 24 14:03:30 2022 from 194.199.18.129
your_login@access2-cp:~$ 

Data storage

There are 2 different storage:

  • your home directory which has only 10Go of space.
  • share scratch accessible to all team.

You should not store data on your home directory since it has limited space and should be kept for the most important files.

The scratch is available at /services/scratch/TEAM_NAME/USERNAME

If your team folder does not exist you can create a ticket on helpdesk to create it.

Also if your personal folder does not exist, you should be able to create it, you can also ask to your CMI to create it.

The scratch is the prefered space to store data linked to experiment.

Execution environment

Except if you have specific system dependencies, the recommanded way to run your job on the cluster is to use conda. (Otherwise you should use singularity or apptainer. If you need to use it and never used containerization, you should ask for help from someone used to this kind of technology because the learning curve can be really steep.)

To be able to use conda, it is required to install it in /services/scratch/TEAM_NAME/USERNAME, it is probably doable to share a conda installation but you will probably need to modify some permissions.

To install conda follow, the following procedure

  1. Go to your home directory or to your scratch directory and download miniconda installer:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  1. then execute the following command
chmod +x Miniconda3-latest-Linux-x86_64.sh && ./Miniconda3-latest-Linux-x86_64.sh
  1. When asked about the installation location, put it into you scratch space : /services/scratch/TEAM_NAME/USERNAME/miniconda3. Then accept to run conda init.

  2. Source your ~/.bashrc for the changes to take action : source ~/.bashrc

Congratulations, now you can create python environments for your projects !

Run your first job

The scheduler used to handle the resources is OAR.

You can run a job with the following command:

oarsub -I

It will run an interactive job

But there are different parameters which allow to specify the server type

Two kind of parameters exist the ones set with -l and the ones step with -p

You can specify an host with:

oarsub -I -p "cluster='thoth' AND host='node3-thoth.inrialpes.fr'"

You have to specify the cluster when targetting a specific node. You can specify multiple node and cluster with a combinaison of OR and AND

You can change the lenght of a job

oarsub -l "walltime=48:0:0" "/path/to/my/script.sh" # launch a job for 48 hours

Ask for 2 GPU on a single node:

oarsub -l "/host=1/gpudevice=2"

Ask for 2 GPU on 2 nodes:

oarsub -l "/host=2/gpudevice=2"

For example, you can also set, a number of cores, a minimum of RAM …

The name of the parameter correspond to the name of the parameter in monika.

Almost every node have been funded by a specific team, the funding teams are priority on the nodes they have funded and as a consequence can kill every best effort job.

If you are not in a team which has funded nodes, your jobs will be launch by default in the best effort queue and idempotent. Which is the lowest priority queue and idempotent means that your job will be relaunch if it gets killed by a default queue job.

You can connect to your running job with:

oarsub -C **JOB_ID**

You can delete your running job with:

oardel **JOB_ID**

You can get extra information concerning a job with

oarstat -f -j **JOB_ID**

You can show currently running and waiting jobs for a user with

oarstat -u **JOB_ID**

Example : Run a script.sh on 4 nodes, using 16 cores on each node, on any cluster, with besteffort, during max 30 minutes:

$ oarsub –l /nodes=4/core=16,walltime=00:30:00 -p "cluster=’SIC’OR cluster=’nanod’ OR cluster=’mistis’
OR cluster=’kinovis’ OR cluster=’beagle’, OR cluster=’perception’ OR cluster=’thoth’"
/services/scratch/morpheo/kinovis/script.sh

Cluster schedule can be monitored on :

http://visu-cp.inrialpes.fr/monika
http://visu-cp.inrialpes.fr/drawgantt

You need to be connected to the VPN.

For more information

For a more elaborate use of the cluster, you need to know a bit more about some tools

Oar

OAR commands:

Cmd Options Argument(s)  
oarstat     shows currently running and waiting jobs, and who they belong to
oarsub -I Script or executable Interactive mode: connects to a slave node and opens a shell for at most 2 hours
-t besteffort Runs as besteffort, job can be killed by higher privileged job
-l /nodes=x/cpu=x/core=x, 

 

walltime=xx:xx:xx

Number of nodes/cores/cpus and duration requested for the job
-p “cluster=’SIC’[OR cluster=’name’]” Chose the cluster(s) to run on. Can be SIC, kinovis, nanod, beagle, mistis, perception, thoth. Besteffort on other teams clusters
oarsub -C JOB_ID Connects to the job
oardel   JOB_ID Kills the job

 

The standard output and error are redirected to OAR.JOB_ID.stdout and OAR.JOB_ID.stderr.

Example : Run a script.sh on 4 nodes, using 16 cores on each node, on any cluster,  with besteffort, during max 30 minutes:

$ oarsub –l /nodes=4/core=16,walltime=00:30:00 -p "cluster=’SIC’OR cluster=’nanod’ OR cluster=’mistis’ OR cluster=’kinovis’ OR cluster=’beagle’, OR cluster=’perception’ OR cluster=’thoth’" /services/scratch/morpheo/kinovis/script.sh

Singularity or apptainer

Docs

Converting a Docker image to singularity/apptainer

To perform the conversion Singularity 3.0 minimum or apptainer v1 is required. There are several ways to convert (run directly a docker image, run directly a docker container but there can be some errors) : the best is building a dedicated singularity image from the docker tar archive.

WARNING: You must be the root user to build from a Singularity recipe file, then singularity must be added to sudoer, with setenv to get the singularity env variables

$ vi /etc/sudoers.d/singularity
loginx ALL=(root) NOPASSWD:SETENV: /usr/local/bin/singularity
loginy ALL=(root) NOPASSWD:SETENV: /usr/local/bin/singularity
$ sudo -E singularity build ./my_image_singularity.img docker-archive:./my_image_docker_tar.tar
-> /root is used for cache despite invoking -E with sudo that should tell to use our environment variables !
$ sudo SINGULARITY_CACHEDIR=/scratch/loginx/singularity/ SINGULARITY_TMPDIR=/scratch/loginx/singularity/ singularity build ./my_image_singularity.img docker-archive:./my_image_docker_tar.tar

FINALLY did the job and created a singularity image from the docker one

Les commentaires sont clos.