Usage

From Digitalis

(Difference between revisions)
Jump to: navigation, search
(Machine access)
 
(181 intermediate revisions not shown)
Line 1: Line 1:
 +
{{Tabs}}
 +
 +
= THE DIGITALIS PLATFORM IS CLOSED / DISMANTLED =
 +
 +
As of summer 2018, Digitalis time is over.
 +
= Overview =
= Overview =
-
Technically speaking, the Digitalis platform is composed of the hardware machines described below. Some of them are managed by the Grid'5000 team (national service), some of them are managed locally.
+
The '''Digitalis platform''' is an experimentation platform for '''research in distributed computing''' (parallel computing, system, networking). Digitalis is a satellite platform of [http://www.grid5000.fr Grid'5000], hosted within the site of Grenoble of Grid'5000.
-
This page decribes how to use the locally managed machines.
+
The platform is managed by '''Pierre Neyron (LIG/CNRS)'''.
-
= Hardware description =
+
= The machines =
== Grid'5000 Grenoble clusters ==
== Grid'5000 Grenoble clusters ==
-
Grenoble Grid'5000 site is composed of 3 clusters (as of 2012-03): genepi, edel and adonis. More information can be found on Grid'5000 Grenoble site pages.
+
Grenoble [https://www.grid5000.fr Grid'5000] site is composed of 3 clusters (as of 2012-03): genepi, edel and adonis. More information can be found on [https://www.grid5000.fr/mediawiki/index.php/Grenoble:Home Grid'5000 Grenoble site pages].
-
Those machines are handled by the Grid'5000 global (national) system. One must then refer to the Grid'5000 documentation to know how to use them. The remaining of this page is mostly not relevant to those clusters.
+
Those machines are handled by the Grid'5000 global (national) system, and managed by Grid'5000 engineer team. One must then refer to the [https://www.grid5000.fr/mediawiki/index.php/Category:Portal:User Grid'5000 documentation] to know how to use them. The next paragraphs of this page are mostly not relevant to those clusters.
-
== Grimage cluster ==
+
Grid'5000 resources can be accessed indifferently in any Grid'5000 sites (i.e. Grenoble users are not restricted to Grenoble hardware).
-
[[File:Grimage10GE.png|200px|thumb|right|Grimage 10GE network]]
+
-
The Grimage cluster was originally dedicated to connect the Grimage platform hardware (cameras, etc) and process its data (videos captures, etc).  
+
-
More recently, 10GE ethernet cards were added to some nodes for a new project, making the cluster a mutualized platform (multi-project). Currently, at least 4 projects are using the cluster, requiring the resource management system and deployment system adapted to a experimental platform, just like Grid'5000.
+
One just need a [https://www.grid5000.fr/mediawiki/index.php/Grid5000:Get_an_account Grid5000 account] to access resources of Grid'5000.
-
; The hardware configuration of the grimage nodes may change:
+
== CIMENT pole ID ==
-
* new video (GPU) cards may be installed over time
+
As of 2014, the pole ID of [https://ciment.ujf-grenoble.fr/wiki-pub/index.php/Welcome_to_the_CIMENT_site! CIMENT] has no specific hardware in CIMENT (managed with the CIMENT stack). Grid'5000 Grenoble's site hardware is however used by CIMENT for some purposed like training (GPUs), etc.
-
* 10GE network connections may change
+
Also, CIMENT storage (Irods) is replicated on data storage in Grid'5000 network.
-
* ...
+
 
 +
Other CIMENT resources (e.g. the Froggy cluster, 3000cores) can nevertheless be used. One must request an [https://ciment.ujf-grenoble.fr/wiki-pub/index.php/Get_an_account CIMENT access].
 +
 
 +
== Digitalis local machines ==
 +
Digitalis includes machines which are not managed by the Grid'5000 team, but benefit from many services provided by Grid'5000 (tight cooperation).
 +
First of all, access to those machines uses the Grid'5000 account credentials (more details below).
 +
 
 +
=== [[Grimage|Grimage cluster]] ===
 +
Cluster of the Grimage platform, and more.
 +
 
 +
=== [[Ppol|Ppol cluster]] ===
 +
3 recycled X86 machines from the Pipol platform: hybrid configuration of SSD + HDD and 10Gbps Ethernet.
 +
 
 +
=== [[Kinovis|Kinovis cluster]] ===
 +
Currently the cluster (acquisition servers, ...) of the [http://kinovis.inrialpes.fr Kinovis platform] supports the required functions for the aquisition platform only.  
-
; Current 10GE network setup is as follows:
+
Two machines are however available: [[idcin]].
-
* One Myricom dual port card is installed on each of grimage-{4,5,7,8}
+
-
* One Intel dual port card is installed on each of grimage-{2,5,6,7}
+
-
Connexions are point to point (NIC to NIC, no switch) as follows:
+
-
* Myricom: grimage-7 <-> grimage-8 <-> grimage-4 <-> grimage-5
+
-
* Intel: grimage-2 <=> grimage-5 et grimage-6 <=> grimage-7 (double links)
+
-
== Special machines ==
+
=== Research teams' machines ===
-
Those machines are resources co-funded by several teams in order to provide experimental platforms for problems such as:
+
Those machines are co-funded by several teams from [http://www.liglab.fr LIG] or [http://www.inria.fr/centre/grenoble Inria Grenoble] (mostly Mescal & Moais) in order to provide experimental platforms such as:
 +
* new or complex processor architectures
* large and complex SMP configurations
* large and complex SMP configurations
-
* complex processor/cache architecture analysis
 
* multi-GPU configurations
* multi-GPU configurations
* etc
* etc
-
Currenlty the following machines are available
+
The following machines are available:
-
=== idgraf ===
+
* [[idfreeze]]
-
* 2x Intel Xeon X5650 (Westmere, 6 cores each, total 12 cores)
+
* [[idgraf]]
-
* 72 GB RAM
+
* [[idphix]]
-
* 8x Nvidia Tesla C2050
+
* [[idbool]]
 +
* [[idarm]]
 +
* [[idkat]]
-
=== idfreeze ===
+
As a courtesy to other researchers, those machines can be accessed when available.
-
* 4x AMD Opteron 6174 (total 48 cores)
+
-
* 256 GB RAM
+
== Hardware summary table ==
== Hardware summary table ==
-
{|
+
{{Hardware}}
-
|-align="center" bgcolor="#E3E3F9"
+
-
| colspan="6"| '''Platform: Grid'5000''' -> access via frontend.grenoble.grid5000.fr
+
-
|-align="center" bgcolor="#E3E3F9"
+
-
| '''Machine''' || '''CPU''' || '''RAM''' || '''GPU''' || '''Network''' || '''Other'''
+
-
|-bgcolor="#F9F9E3"
+
-
| genepi-[1-34].grenoble.grid5000.fr||2x Intel E5420 (8C)||8GB DDR2||||IB DDR||
+
-
|-bgcolor="#E3F9E3"
+
-
| edel-[1-72].grenoble.grid5000.fr||2x Intel E5520 (8C)||24GB DDR3||||IB QDR||
+
-
|-bgcolor="#F9F9E3"
+
-
| adonis-[1-10].grenoble.grid5000.fr||2x Intel E5520 (8C)||24GB DDR3||1/2x S1070 (2GPU)||IB QDR||
+
-
|-align="center" bgcolor="#E3E3F9"
+
-
| colspan="6"| '''Platform: Digitalis''' -> access via digitalis.grenoble.grid5000.fr
+
-
|-align="center" bgcolor="#E3E3F9"
+
-
| '''Machine''' || '''CPU''' || '''RAM''' || '''GPU''' || '''Network''' || '''Other'''
+
-
|-bgcolor="#E3F9E3"
+
-
|grimage-1.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||1x GTX-680 (1GPU)||IB DDR||'''Keyboard/Mouse/Screen attached''' (4/3 screen, on the left, same as grimage-7)
+
-
|-bgcolor="#F9F9E3"
+
-
|grimage-2.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||||IB DDR + 1x 10GE (DualPort)||2x Camera (firewire)
+
-
|-bgcolor="#E3F9E3"
+
-
|grimage-3.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||1x GTX-680 (1GPU)||IB DDR||'''Keyboard/Mouse/Screen attached''' (16/9 screen, on the right) + 2x cameras (firewire)
+
-
|-bgcolor="#F9F9E3"
+
-
|grimage-4.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||||IB DDR + 1x 10GE (DualPort)||2x Camera (firewire)
+
-
|-bgcolor="#E3F9E3"
+
-
|grimage-5.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||||IB DDR + 2x 10GE (DualPort)||2x Camera (firewire)
+
-
|-bgcolor="#F9F9E3"
+
-
|grimage-6.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||||IB DDR + 1x 10GE (DualPort)||
+
-
|-bgcolor="#E3F9E3"
+
-
|grimage-7.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||1x GTX-580 (1GPU)||IB DDR + 2x 10GE (DualPort)||'''Keyboard/Mouse/Screen attached''' (4/3 screen, on the left, same as grimage-1)
+
-
|-bgcolor="#F9F9E3"
+
-
|grimage-8.grenoble.grid5000.fr||2x Intel E5530 (8C)||12GB DDR3||||IB DDR + 1x 10GE (DualPort)||
+
-
|-bgcolor="#E3F9E3"
+
-
|grimage-9.grenoble.grid5000.fr||2x Intel E5620 (8C)||24GB DDR3||1x GTX-295 (2GPU)||IB DDR||
+
-
|-bgcolor="#F9F9E3"
+
-
|grimage-10.grenoble.grid5000.fr||2x Intel E5620 (8C)||24GB DDR3||1x GTX-295 (2GPU)||IB DDR||
+
-
|-bgcolor="#E3F9E3"
+
-
|idgraf.grenoble.grid5000.fr||2x Intel X5650 (12C)||72GB DDR3||8x Tesla C2050 (8GPU)||||
+
-
|-bgcolor="#F9F9E3"
+
-
|idfreeze.grenoble.grid5000.fr||4x AMD 6174 (48C)||256GB DDR3||||||
+
-
|-align="center" bgcolor="#E3E3F9"
+
-
| colspan="6"| '''Retired, but kept just in case'''
+
-
|-align="center" bgcolor="#E3E3F9"
+
-
| '''Machine''' || '''CPU''' || '''RAM''' || '''GPU''' || '''Network''' || '''Other'''
+
-
|-bgcolor="#E3F9E3"
+
-
|idkoiff.imag.fr||8x AMD 875 (16C)||32GB DDR2||1x GTX-280 (1GPU)||||
+
-
|}
+
= Services =
= Services =
Line 98: Line 66:
Machines from any Grid'5000 site can communicate without administrative restriction (access control), and with a very high throughput (10GE backbone).
Machines from any Grid'5000 site can communicate without administrative restriction (access control), and with a very high throughput (10GE backbone).
-
However, Grid'5000 is a very powerful scientific instrument, hence the outside world must me protected from buggy experiments or uncontrolled behaviors. Please read the following pages for information about this:
+
However, since Grid'5000 is a very powerful scientific instrument, the outside world must me protected from buggy experiments or uncontrolled behaviors. Please read the following pages for information about this:
* https://www.grid5000.fr/mediawiki/index.php/Security_model
* https://www.grid5000.fr/mediawiki/index.php/Security_model
* https://www.grid5000.fr/mediawiki/index.php/Security_policy
* https://www.grid5000.fr/mediawiki/index.php/Security_policy
Line 105: Line 73:
== Dedicated services ==
== Dedicated services ==
-
Dedicated services are provided for the management of our machines. Indeed, our machines couldn't fit in Grid'5000 model, due to their special characteristics and usage:
+
Dedicated services are provided for the management of our machines. Indeed, our local machines couldn't fit in Grid'5000 model, due to their special characteristics and usage:
-
The Grimage cluster is special in the fact that it operates the Grimage platform with cameras and other equipments attached, making it's hardware configuration different. Other local machines are special in the fact that they are unique resources, which make their model of usage very different from the one of a cluster of many identical machines as found with Grid'5000 clusters.
+
The Grimage cluster is special in the fact that it uses to operate the Grimage platform with cameras and other equipments attached, making it's hardware configuration different.
 +
Other local machines are special in the fact that they are unique resources, which make their model of usage very different from the one of a cluster of many identical machines as found with Grid'5000 clusters.
-
As a result, a dedicated resource management system (OAR) is provided to manage the access to the machines, with special mechanics (different from the ones provided in Grid'5000). A dedicated deployment system (kadeploy) is also provided to handle user's customized operating systems that can be deployed on the machines. Even if different from the main Grid'5000 tools, many of the documented information for the Grid'5000 tools also apply to our dedicated services. This document actually only explains their specificities.
+
As a result, a dedicated resource management system (OAR) is provided to manage the access to the machines, with some special mechanisms (different from the ones in Grid'5000). A dedicated instance of the deployment system (kadeploy) is also provided to handle user's customized operating systems that can be deployed on the machines. Even if different from the main Grid'5000 tools, many documentions of Grid'5000 also apply to our dedicated services. This document actually only explains their specificities.
-
OAR and Kadeploy frontend for our machines is the machine named '''digitalis.grenoble.grid5000.fr'''.
+
OAR and Kadeploy frontend for Digitalis machines (i.e. not Grid'5000) is the machine named '''digitalis.grenoble.grid5000.fr'''.
== Mutualised services (services provided by Grid'5000) ==
== Mutualised services (services provided by Grid'5000) ==
Line 123: Line 92:
== Terms of service ==
== Terms of service ==
-
Grid'5000 services are handled nationaly for the global platform (11 sites, France-wide). As a result, some aspects may seam more complex than the should from a local perspective. Please mind the fact that some services are not for our local conveniance only. Furthermore, the local platform is to be seen as an extension to the main Grid'5000 platform, that is not supported by the Grid'5000 staff, even if we can freely benefit from some services they provide.
+
Grid'5000 services are handled nationaly for the global platform (11 sites, France-wide). As a result, some aspects may seam more complex than the should from a local perspective. Please mind the fact that some services are not for our local conveniance only. Furthermore, '''the local platform is to be seen as an extension to the main Grid'5000 platform, that is not supported by the Grid'5000 national staff''', even if we can freely benefit from some services they provide.
-
As a result, we are subject to rules edicted by the Grid'5000 platform:
+
As a result, we are subject to the rules of the Grid'5000 platform:
* Security policies: restricted access to the network, output traffic filtering.
* Security policies: restricted access to the network, output traffic filtering.
-
* Maintenance schedules: Thursday is the maintenance day, do not be surprised if interruption of services happen on that day !
+
* Maintenance schedules: Thursday is the maintenance day, do not be surprised if an interruption of services happen on that day !
-
* Rules of good behavior within the large Grid'5000 user community (reading the mailing lists is a must)
+
* Rules of good behavior within the large Grid'5000 user community (please pay attention to the mailing lists)
-
If one is using the "official" Grid'5000 nodes, one must comply to the UserCharter (as approved by every user when requesting a Grid'5000 account)
+
If one is using the "official" Grid'5000 nodes, one must comply to the [https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter Grid'5000 charter] (as approved by every user when requesting a Grid'5000 account)
 +
 
 +
{{Template:Acknowledgment}}
== Data integrity ==
== Data integrity ==
Line 140: Line 111:
= Platform usage =
= Platform usage =
 +
== Charter of good usage ==
 +
The charter of usage for the machines of Digitalis (except for Grid'5000 ''official'' machines which follow [https://www.grid5000.fr/mediawiki/index.php/Grid5000:UserCharter Grid'5000 Charter]) is the following:
 +
 +
; Communities:
 +
Users for the platform are split in 2 communities:
 +
* the owners of the machines (e.g. local users, buyers)
 +
* the others
 +
The others are welcome to use the machines, but the owners keep priority and privileged rights (e.g. can possibly ask to drop jobs from others).
 +
In any case, everybody is encouraged to plan its experiments, and possibly to book resources in advance while trying to ask for reasonable (fair) shares of the resources (walltime).
 +
 +
 +
'''Also, time is split in to phases: daytime and night.'''
 +
; During daytime
 +
* jobs should use the shared access as much a possible
 +
* if machine are obviously unused, one may consider running exclusive (or deploy) jobs, but please try to limit them to 2 hours max (possibly renewable, see ''redeploy job type'' for instance).
 +
* during high pressure periods, like before dead-lines, any usage by local users might preempt other usage.
 +
; During the night
 +
* night is everyday: 18:00 > 9:00, or week-ends 18:00 on Friday > 9:00 on Monday or holidays (like Christmas but not school holiday)
 +
* night is the time for long, exclusive jobs, for experiments requiring exclusive access to the resources (for performance reasons for instance)
 +
* however, if one just needs a long job, it is of course always preferred to run in the shared access mode
 +
 +
For now, the charter policy is not enforced by any technical mean, so everyone's kindness is appreciated.
 +
 +
Also, if one requires a special usage, ''out of the charter'' for the resources, one is encouraged to inform every other users using the mailing list [mailto:digitalis@list.grid5000.fr digitalis@list.grid5000.fr].
 +
 +
Again, while trying to foster mutualisation as much as possible, owners of the machines keep higher priority and privileges.
 +
== Access to Digitalis ==
== Access to Digitalis ==
-
=== Access to Grid5000 ===
+
=== Get a Grid'5000 account ===
-
To access Digitalis, first of all, your need enter the Grid'5000 network.
+
As a prerequisite to access Digitalis, your need to be able to access Grid'5000's network.
-
For that, you need an account, see: https://www.grid5000.fr/mediawiki/index.php/Grid5000:Get_an_account
+
For that prupose, you require a Grid'5000 account. If you do not have one yet, please see: https://www.grid5000.fr/mediawiki/index.php/Grid5000:Get_an_account. Most likely, you should end up on the following [https://www.grid5000.fr/mediawiki/index.php/Special:G5KRequestAccountUMS form] (relevant for for French academics)
-
Make sure your account belongs to the '''digitalis''' groups (among others possibly).
+
-
With that account you can ssh to the Grid'5000 network using ssh:
+
Also, '''make sure your account belongs to the digitalis group'''.
-
$ ssh access.grid5000.fr
+
-
;NB:
+
If you do not know a Grid'5000 manager, set pneyron as you manager.
-
The tips and tricks section below provide a lot of useful information to ease the access.
+
-
=== Access to the machines ===
+
For the initial user report, please mention your intended usage of the '''Digitalis platform'''.
-
'''Access to the machine is controlled by the resource manager'''. This means that users '''cannot just ssh''' to a machine and have processes indefinitely running on them (e.g. vi process).
+
-
Any user '''must''' instead book the machine for a period of time (a job), during which access will be granted to him, maybe with some other privileges (depending on the requested type of job). Once the period of time is ended, all rights are revoked, and all processes of the user are '''killed'''.
+
=== Access to Grid5000 ===
-
By default users are not root on the machines. Some privileged commands may however be permitted (e.g. schedtool).
+
Once you have a Grid'5000 account, you can ssh to the Grid'5000 network using ssh:
-
Default access to a machine is '''not exclusive''', which means that many users can have processes on the machine at a same time, unless a user requested an exclusive access.
+
$ ssh access.grid5000.fr
 +
 
 +
'''In case of any issue at that point''', please report to the [https://www.grid5000.fr documentation of Grid'5000].
 +
 
 +
From there you can access to the frontend of the Grid'5000 Grenoble site or other sites, by running
 +
$ ssh grenoble
 +
Or to any other Grid'5000 site, e.g.
 +
$ ssh nancy
 +
 
 +
'''BUT the research team's machines of Digitalis are not manage by Grid'5000 Grenoble's frontend''', see the next paragraph.
-
Special use cases also require full access to the machine: one want to be root, to be able to reboot the machine, or even to be able to install software or a different operating system. Just like on Grid'5000, this is possible, at the cost of the use of '''kadeploy'''.
+
Please see also the '''tips and tricks section''' below which provides '''a lot of useful information to ease the access'''.
 +
=== Access to the Digitalis local machines ===
The frontend machine to use Digitalis' resources is: '''digitalis.grenoble.grid5000.fr'''. From Grid'5000 access machine you can just do:
The frontend machine to use Digitalis' resources is: '''digitalis.grenoble.grid5000.fr'''. From Grid'5000 access machine you can just do:
  $ ssh digitalis.grenoble.grid5000.fr
  $ ssh digitalis.grenoble.grid5000.fr
-
Then, you can use OAR and other tools to get access to the experimentation machines, see next.
+
Like for Grid'5000 machines (but with a slightly different charter), '''access to the teams' machines is controlled by a resource manager'''.
 +
 
 +
This means that users '''cannot just ssh''' to a machine and have processes indefinitely running on them (e.g. vi or emacs processes).
 +
 
 +
Any user '''must''' book the machine for a period of time (a job), during which access will be granted to him.
 +
 
 +
Once the period of time is ended, all rights are revoked, and all processes of the user are '''killed'''.
 +
 
 +
By default users are not root on the machines. Some privileged commands may however be permitted (e.g. schedtool).
 +
Default access to a machine is '''not exclusive''', which means that many users can have processes on the machine at a same time, unless a user requested an exclusive access.
 +
 
 +
Just like on Grid'5000, this is possible on some machine to '''kadeploy'''. Special use cases indeed require full access to the machine: need to be root, to reboot the machine, or to install software or a different operating system, without breaking it for other.
 +
 
 +
As a result, you need to use the OAR commands to get access to the experimentation machines.
== Use cases ==
== Use cases ==
Line 180: Line 197:
  Connect to OAR job 1122 via the node idgraf.grenoble.grid5000.fr
  Connect to OAR job 1122 via the node idgraf.grenoble.grid5000.fr
  pneyron@idgraf:~$  
  pneyron@idgraf:~$  
 +
 +
''(Mind looking at the dedicated page of each machine for its details, e.g. for [[idbool]], one must use the `-l machine=1' option to run a job on the whole machine).''
 +
You then get access to the machine for 1 hour by default (add <tt>-l walltime=4</tt> for 4 hours).
You then get access to the machine for 1 hour by default (add <tt>-l walltime=4</tt> for 4 hours).
-
Note that if the machine is not available (e.g. an exclusive job is already running), you will have to wait until it is freed up (see [[#Resource usage visualization tools|the resource usage visualization tools]]).
+
Note that if the machine is not available (e.g. an exclusive job is already running), you will have to wait until it is free (see [[#Resource usage visualization tools|the resource usage visualization tools]]).
If no machine is specified, you get access to one of the grimage nodes.
If no machine is specified, you get access to one of the grimage nodes.
Line 188: Line 208:
You can use the '''oarsh''' command to open other shells to the machine, as long as the job is still running.
You can use the '''oarsh''' command to open other shells to the machine, as long as the job is still running.
-
Please read OAR's documentation for more details.
+
Please read the man pages of the OAR commands for more details.
=== I want to gain exclusive access to a machine for N hours ===
=== I want to gain exclusive access to a machine for N hours ===
-
To access to a machine and be alone (to avoid noises of other users), give the exclusive type to your job:
+
To get access to a machine as only user (e.g. in order to avoid noises from other users), use the exclusive job type:
-
  pneyron@digitalis:~$ oarsub -I -p "machine = 'idgraf'" -t exclusive -l walltime=N
+
  pneyron@digitalis:~$ oarsub -I -p "machine = 'idgraf'" '''-t exclusive''' -l walltime=N
  [ADMISSION RULE] Modify resource description with type constraints
  [ADMISSION RULE] Modify resource description with type constraints
  Import job key from file: .ssh/id_rsa
  Import job key from file: .ssh/id_rsa
Line 201: Line 221:
  Connect to OAR job 1122 via the node idgraf.grenoble.grid5000.fr
  Connect to OAR job 1122 via the node idgraf.grenoble.grid5000.fr
  pneyron@idgraf:~$  
  pneyron@idgraf:~$  
-
You then get access to the machine for N hours, nobody else can access the machine during your job.
+
This way you get access to the machine for N hours, and nobody else can access the machine during your job.
Note that if the machine is not available, you will have to wait until it is free (see [[#Resource usage visualization tools|the resource usage visualization tools]]).
Note that if the machine is not available, you will have to wait until it is free (see [[#Resource usage visualization tools|the resource usage visualization tools]]).
-
Also, some privileged command can be run via sudo in exclusive jobs (see below).
+
Also, some privileged command can be run via sudo in exclusive jobs (see the machines' dedicated pages).
 +
 
 +
=== I want to open a new shell in an existing job ===
 +
There are several ways to open a shell in a OAR job.
 +
 
 +
Assuming you created a job as follows:
 +
[pneyron@digitalis ~]$ oarsub "sleep 1h"
 +
Properties:
 +
[ADMISSION RULE] Modify resource description with type constraints
 +
Generate a job key...
 +
OAR_JOB_ID=6028
 +
 
 +
You can:
 +
; Use oarsub -C <job id>
 +
[pneyron@digitalis ~]$ oarsub -C 6028
 +
Connect to OAR job 6028 via the node grimage-8.grenoble.grid5000.fr
 +
[OAR] OAR_JOB_ID=6028
 +
[OAR] Your nodes are:
 +
      grimage-8.grenoble.grid5000.fr*8
 +
 +
[pneyron@grimage-8 ~](6028-->58mn)$
 +
NB: With this method, you do not need to know the nodes used by your job, but the job id. Also the environment is the same as when in the shell opened upon oarsub.
 +
 
 +
; Use oarsh with the OAR_JOB_ID=<job id> environment variable
 +
[pneyron@digitalis ~]$ OAR_JOB_ID=6028 oarsh grimage-8.grenoble.grid5000.fr
 +
Linux grimage-8.grenoble.grid5000.fr 2.6.32-grimage #1 SMP Fri Jan 6 14:10:41 UTC 2012 x86_64
 +
This is a Grid'5000 compute node.
 +
You must have a reservation with OAR before using this host.
 +
Last login: Fri Feb 21 16:54:42 2014 from mu2.grenoble.grid5000.fr
 +
[pneyron@grimage-8 ~]$
 +
NB: later on, you can also use oarsh on the node to connect from node to node (useful in multi node jobs)
 +
 
 +
; Use oarsh with a job key
 +
For that, create a public/private key pair on digitalis '''with no passphrase''' ''(for the sack of the ease of use and because this key should be for Grid'5000 internal usage only)'':
 +
pneyron@digitalis:~$ ssh-keygen -t rsa
 +
Generating public/private rsa key pair.
 +
Enter file in which to save the key (/home/pneyron/.ssh/id_rsa):
 +
[...]
 +
''Again: Do not use your existing sensible SSH keys here, for instance located on your workstation and protected by a passphrase of course !''
 +
 
 +
Then export the OAR_JOB_KEY_FILE environement variable:
 +
[pneyron@digitalis ~]$ export OAR_JOB_KEY_FILE=~/.ssh/id_rsa
 +
''You can also add the export line to you .bashrc if meaningful to you (make sure your .bashrc is sourced upon login, or look at your .profile or .bash_profile...)''
 +
 
 +
You will now see that the oarsub command will use this key for your jobs.
 +
[pneyron@digitalis ~]$ oarsub  "sleep 1h"
 +
Properties:
 +
[ADMISSION RULE] Modify resource description with type constraints
 +
'''Import job key from file: /home/pneyron/.ssh/id_rsa'''
 +
OAR_JOB_ID=6029
 +
[pneyron@digitalis ~]$
 +
 
 +
And you can connect to the job afterward:
 +
[pneyron@digitalis ~]$ export OAR_JOB_KEY_FILE=~/.ssh/id_rsa # useless if export done in .bashrc
 +
[pneyron@digitalis ~]$ oarsh grimage-9.grenoble.grid5000.fr
 +
Linux grimage-9.grenoble.grid5000.fr 2.6.32-grimage #1 SMP Fri Jan 6 14:10:41 UTC 2012 x86_64
 +
This is a Grid'5000 compute node.
 +
You must have a reservation with OAR before using this host.
 +
Last login: Thu Feb 20 14:29:09 2014 from mu2.grenoble.grid5000.fr
 +
[pneyron@grimage-9 ~]$
 +
 
 +
NB:
 +
* Beware that with this method, if you have two or more jobs using the same job key and running simultaneously on the same node, you will '''always connect to the first job''' (your shell will end with that job).
 +
* Using a job key allows to ssh to a job from outside Grid'5000 network, see [[#I_want_to_ssh_directly_from_my_workstation_to_my_experimentation_machine]].
 +
 
 +
=== I want to run batch jobs, like on a regular HPC cluster ===
 +
If you don't want your jobs to overlap like with shared and exclusive job types, you can use the '''batch''' job type.
 +
This job type activate OAR's original behavior, where one job waits for the termination of the previous job before starting.
 +
 
 +
Example:
 +
* First job:
 +
[pneyron@digitalis ~]$ oarsub  -p "host like 'grimage-10.%'" '''-t batch''' 'sleep 2h'
 +
Properties: host like 'grimage-10.%'
 +
[ADMISSION RULE] Modify resource description with type constraints
 +
Import job key from file: /home/pneyron/.ssh/id_rsa
 +
OAR_JOB_ID=5795
 +
* Second job (-I is used here for the purpose of the demonstartion only)
 +
[pneyron@digitalis ~]$ oarsub  -p "host like 'grimage-10.%'" '''-t batch''' 'sleep 2h' -I
 +
Properties: host like 'grimage-10.%'
 +
[ADMISSION RULE] Modify resource description with type constraints
 +
Import job key from file: /home/pneyron/.ssh/id_rsa
 +
OAR_JOB_ID=5796
 +
Interactive mode : waiting...
 +
[2014-01-31 22:10:47] Start prediction: 2014-01-31 23:11:43 (FIFO scheduling OK)
 +
 
 +
NB: batch jobs are exclusive (but not timesharing=*,user)
=== I want to execute privileged commands on my node ===
=== I want to execute privileged commands on my node ===
Within a '''exclusive job''', some privileged commands can be run via sudo. Those authorized privileged commands typically have an impact on other users, hence they require an exclusive access (job) to the machine.  
Within a '''exclusive job''', some privileged commands can be run via sudo. Those authorized privileged commands typically have an impact on other users, hence they require an exclusive access (job) to the machine.  
-
Currently, the following commands can be run via sudo in exclusive jobs:
+
See the page dedicated to each machine for information about the available commands ([[grimage#Privileged_commands|grimage]], [[idfreeze#Privileged_commands|idfreeze]], [[idgraf#Privileged_commands|idgraf]], [[idphix#Privileged_commands|idphix]]).
-
; on idgraf
+
-
* sudo /usr/bin/whoami (provided for testing the mechanism, should return "root")
+
-
* sudo /sbin/reboot
+
-
* sudo /usr/bin/schedtool
+
-
* sudo /usr/bin/nvidia-smi (please notify other users via the [mailto:digitalis@lists.grid5000.fr digitalis mailing list] if you change parameters on GPUs that will not be reset to default after a reboot, '''e.g. the memory ECC configuration''')
+
-
* sudo /usr/local/bin/ipmi-reset
+
-
; on idfreeze
+
-
* sudo /usr/bin/whoami (provided for testing the mechanism, should return "root")
+
-
* sudo /sbin/reboot
+
-
* sudo /usr/bin/schedtool
+
-
* sudo /usr/bin/opcontrol
+
-
; on grimage
+
-
* sudo /usr/bin/whoami (provided for testing the mechanism, should return "root")
+
-
* sudo /sbin/reboot
+
-
* sudo /usr/bin/schedtool
+
-
* sudo /usr/bin/nvidia-smi
+
-
If the privileged command you need is not available (available commands run without any sudo password prompt), you can ask your administrator whether it's possible to enable it, but command considered harmful to the system will not made available. Please mind deploying your own operating system on the machine to get full privileges.
+
If the privileged command you need is not available (available commands run without any sudo password prompt), you can ask your administrator whether it's possible to enable it. However, not all command can be safe, and if one is considered harmful to the system, it will not be made available. Please mind deploying your own operating system on the machine to get full privileges.
=== I want to be able to reboot a node without loosing my reservation ===
=== I want to be able to reboot a node without loosing my reservation ===
Line 293: Line 382:
=== I want to change the system (OS, software) on the machine ===
=== I want to change the system (OS, software) on the machine ===
-
Use the deploy type. See Grid'5000 documentation about kadeploy. The kadeploy instance on digitalis works the same way.
+
Use the deploy type. See Grid'5000 documentation about kadeploy. The kadeploy installation on digitalis works the same way.
=== I want to book the machine for next night ===
=== I want to book the machine for next night ===
Line 310: Line 399:
== Q&A / Tips and tricks ==
== Q&A / Tips and tricks ==
 +
=== What do the OAR status of nodes means exactly ? ===
 +
As shown by the [https://intranet.grid5000.fr/oar/grenoble/digitalis/drawgantt-svg/ reservation diagram], or the ''chandler'' command, nodes can be:
 +
* '''Alive''': either free or running a job, this is the normal state of a node
 +
* '''Absent''': the machine is usually rebooting after a deploy job. This is a transitory state: node should be Alive again soon (a few minutes). If the machine stay in the Absent state longer, a problem probably occurred, in that case, this can be considered a abnormal state.
 +
* '''Suspected''': an problem occurred, in the node clean-up for instance. Sometime, the state can be transitory, but the node can be considered in a abnormal state.
 +
* '''Dead''': the node was retired for the reservation system by the administrator for some reason. This is not a abnormal state.
 +
In case of abnormal state, you can try to reset the node by yourself: see the [[Usage#A_node_is_marked_Absent_or_Suspected.2C_how_to_fix_it_.3F|node-reboot command below]]. If this does not fix the issue, you can contact the administrator.
=== Access seems to be broken, what can I do ? ===
=== Access seems to be broken, what can I do ? ===
-
If access to the Grid'5000 network is broken, e.g. the access machine is not reachable:
+
You normally access the Grid'5000 network by ssh'ing to '''access.grid5000.fr'''.
-
# Check the Grid'5000 incident page: https://www.grid5000.fr/status/
+
 
-
# Check your emails about possible outage or maintenance (planned or exceptional)
+
However, if that access machine is not reachable:
-
# Try other access paths to the grid'5000 network:
+
# Check for known issues on Grid'5000 incident page: https://www.grid5000.fr/status/
-
## access-north.grid5000.fr > digitalis.grenoble.grid5000.fr
+
# Check your grid'5000 emails about possible outage or maintenance (planned or exceptional)
-
## access-south.grid5000.fr > digitalis.grenoble.grid5000.fr
+
# Try other access paths to the grid'5000 network, try and cascade ssh as follows:
-
## navajo.imag.fr (need a LIG account) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
+
## from Internet > access-north.grid5000.fr > digitalis.grenoble.grid5000.fr
-
## bastion.inrialpes.fr (needs an Inria account) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
+
## from Internet > access-south.grid5000.fr > digitalis.grenoble.grid5000.fr
 +
## from the intranet of Inria Grenoble or LIG > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
 +
## from Internet: LIG bastion (e.g. atoum.imag.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
 +
## from Internet: Inria Grenoble bastion (e.g. bastion.inrialpes.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
 +
 
 +
;NB:
 +
* You can hide those ssh cascades by playing with the ssh config file and proxycommands (see other tips and man ssh_config).
 +
* You might also want to benefit from a better bandwidth or latency by using the local access (access.grenoble.grid5000.fr).
 +
 
 +
=== Access to the Grid'5000 network is ok, but I can't reach digitalis nor grenoble ===
 +
# Check for known issues on Grid'5000 incident page: https://www.grid5000.fr/status/
 +
# Check your grid'5000 emails about possible outage or maintenance (planned or exceptional)
 +
# Try to access Grenoble's sote directly with one of the following path of cascaded ssh:
 +
## from Inria Grenoble or LIG > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
 +
## from Internet: LIG bastion (e.g. atoum.imag.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
 +
## from Internet: Inria Grenoble bastion (e.g. bastion.inrialpes.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
 +
(connection to access.grenoble.grid5000.fr are restricted to local academic networks)
 +
 
 +
'''NB:''' If you can reach Grid'5000 Grenoble's site (fgrenoble.grenoble.grid5000.fr) but not digitalis.grenoble.grid5000.fr, that probably means that digitalis is broken. Please use the [mailto:digitalis@lists.grid5000.fr digitalis mailing list] to report the problem.
=== I want to access to digitalis directly without having to go first to the access machine ===
=== I want to access to digitalis directly without having to go first to the access machine ===
Add to you ssh configuration on your workstation (~/.ssh/config):
Add to you ssh configuration on your workstation (~/.ssh/config):
-
  cat <<EOF >> .ssh/config
+
  cat <<'EOF' >> .ssh/config
  Host *.g5k
  Host *.g5k
-
  ProxyCommand ssh pneyron@access.grid5000.fr "nc -q 0 `basename %h .g5k` %p"
+
  ProxyCommand ssh pneyron@access.grid5000.fr -W "$(basename %h .g5k):%p"
  User pneyron
  User pneyron
  ForwardAgent no
  ForwardAgent no
Line 340: Line 454:
  pneyron@digitalis:~$
  pneyron@digitalis:~$
-
=== I want to ssh directly from my workstation to my experimentation machine ===
+
Or to copy files without needing a 2 hops operation:
 +
neyron@workstation:~$ scp file digitalis.grenoble.g5k:/tmp/
 +
file                                        100% 2783    2.7KB/s  00:00
 +
 
 +
Same with rsync:
 +
neyron@workstation:~$ rsync -av file digitalis.grenoble.g5k:/tmp/
 +
sending incremental file list
 +
file
 +
sent 77 bytes  received 18 bytes  63.33 bytes/sec
 +
total size is 15  speedup is 0.16
 +
 
 +
;NB: This can be used to connect to any machine within Grid'5000 from the outside, assuming you can already ssh from the inside ('''Watch out''': see below if you want to connect to a machine within a job, this needs oarsh then)
 +
 
 +
=== I want to ssh directly from my workstation to a job on a experimentation machine ===
(Note: This does not apply to the case of deploy jobs)
(Note: This does not apply to the case of deploy jobs)
-
Make sure that jobs you create use a job key. For that, create a public/private key pair on digitalis (with no passphrase):
+
Make sure that the job you create uses a '''job key'''. See [[#I_want_to_open_a_new_shell_in_an_existing_job]]
-
pneyron@digitalis:~$ ssh-keygen -t rsa
+
-
Generating public/private rsa key pair.
+
-
Enter file in which to save the key (/home/pneyron/.ssh/id_rsa):
+
-
[...]
+
-
(Don't use your existing SSH keys, located on your workstation and protected by a passphrase, for security concerns)
+
-
Then add to your .bashrc in your home (on digitalis for instance):
+
You should have a ssh/job key in ~/.ssh/id_rsa.
-
cat <<EOF >> ~/.bashrc
+
-
export OAR_JOB_KEY_FILE=~/.ssh/id_rsa
+
-
EOF
+
-
(Make sure your .bashrc is sourced upon login, or look at your .profile...)
+
-
 
+
-
The oarsub command will now use this key for your jobs.
+
-
pneyron@digitalis:~$ oarsub -I
+
-
[ADMISSION RULE] Modify resource description with type constraints
+
-
Import job key from file: /home/pneyron/.ssh/id_rsa
+
-
OAR_JOB_ID=1119
+
-
[...]
+
Copy your keys on your worskation:
Copy your keys on your worskation:
Line 368: Line 479:
Add to your ssh configuration on your workstation (~/.ssh/config):
Add to your ssh configuration on your workstation (~/.ssh/config):
-
  cat <<EOF >> .ssh/config
+
  neyron@workstation:~$ cat <<'EOF' >> .ssh/config
  Host *.g5koar
  Host *.g5koar
-
  ProxyCommand ssh pneyron@access.grid5000.fr "nc -q 0 `basename %h .g5koar` 6667"
+
  ProxyCommand ssh pneyron@access.grid5000.fr -W "$(basename %h .g5koar):6667"
  User oar
  User oar
  IdentityFile ~/.ssh/id_rsa_g5k
  IdentityFile ~/.ssh/id_rsa_g5k
Line 377: Line 488:
(replace pneyron by your Grid'5000 login)
(replace pneyron by your Grid'5000 login)
-
Then you should be able to ssh directly to a machine '''that you previously reserved in a OAR job''':
+
Assuming you exported the OAR_JOB_KEY_FILE before doing the oarsub
 +
[pneyron@digitalis ~]$ '''export OAR_JOB_KEY_FILE=~/.ssh/id_rsa''' # useless if export done in .bashrc
 +
[pneyron@digitalis ~]$ oarsub -p "machine='idgraf'" "sleep 1h"
 +
Properties: machine='idgraf'
 +
[ADMISSION RULE] Modify resource description with type constraints
 +
Import job key from file: /home/pneyron/.ssh/id_rsa
 +
OAR_JOB_ID=6031
 +
 
 +
Then you should be able to ssh directly to the machine from your workstation:
  neyron@workstation:~$ ssh idgraf.grenoble.g5koar
  neyron@workstation:~$ ssh idgraf.grenoble.g5koar
  Linux idgraf.grenoble.grid5000.fr 3.2.0-2-amd64 #1 SMP Sun Mar 4 22:48:17 UTC 2012 x86_64
  Linux idgraf.grenoble.grid5000.fr 3.2.0-2-amd64 #1 SMP Sun Mar 4 22:48:17 UTC 2012 x86_64
  [...]
  [...]
  pneyron@idgraf:~$
  pneyron@idgraf:~$
 +
 +
=== I want to push/pull data from/to the outside to/from a machine ===
 +
There are several ways of pushing/pulling files from/to the outside.
 +
; Using NFS:
 +
Assuming your Grid'5000 user's NFS home directory is mounted on the destination machine, you can access files from there after copying them to it with one of the following command:
 +
* Using the global access machine:
 +
neyron@workstation$ rsync -av file pneyron@access.grid5000.fr:grenoble/
 +
* using Grenoble local access machine (access restricted):
 +
neyron@workstation$ rsync -av file pneyron@access.grenoble.grid5000.fr:
 +
Then file is available in the home directory of all Grenoble machines:
 +
neyron@machine$ ls /home/pneyron/
 +
file
 +
(replace pneyron by your Grid'5000 login)
 +
 +
; Using the SSH proxy command setup:
 +
See above the setup of the .g5k and .g5koar SSH proxycommand. You can then run commands like
 +
neyron@workstation$ rsync -av file digitalis.grenoble.g5k:/tmp/
 +
neyron@workstation$ rsync -av file idgraf.grenoble.g5koar:/tmp/
=== I want my code to be pushed automatically to the machine ===
=== I want my code to be pushed automatically to the machine ===
Line 390: Line 527:
see
see
-
  man ionotifywait
+
  man inotifywait
=== A node is marked '''Absent''' or '''Suspected''', how to fix it ? ===
=== A node is marked '''Absent''' or '''Suspected''', how to fix it ? ===
Line 413: Line 550:
  pneyron@digitalis:~$
  pneyron@digitalis:~$
-
Rarely, nodes can also be marked as '''Suspected''' for an unknown reason. If a node stays '''Suspected''' for a long time, you can also try to reboot it, using the same command.
+
Rarely, nodes can also be marked as '''Suspected''' for an unknown reason (typically, OOM Killer waked up...). If a node stays '''Suspected''' for a long time, you can also try to reboot it, using the same command.
-
 
+
-
NB: '''idfreeze is not supported by the node-reboot command for now'''. If you break it, buy some help from me.
+
=== kaconsole3 is not working on idgraf ===
=== kaconsole3 is not working on idgraf ===
Line 433: Line 568:
type "&."
type "&."
-
=== What is x2x and how to use it ===
 
-
{{Note|text=This tip is only useful for people that have to work in the Grimage room, with a screen attached to a Grimage machine}}
 
-
 
-
x2x allows to control the mouse pointer and keyboard input of a remote machine over the network (X11 protocol).
 
-
In the case of the Grimage nodes which have a screen attached, it is very practical because it allows to not use the USB mouse and keyboard, which are sometime buggy (because of the out of norm USB cable extension).
 
-
 
-
To use x2x:
 
-
# login locally on the machine (gdm)
 
-
# run <code class="command">xhost +</code> to allow remote X connections.
 
-
# from you workstation: run
 
-
ssh pneyron@grimage-1.grenoble.g5k -X x2x -to grimage-1:0 -west
 
-
 
-
; NB
 
-
* replace pneyron by your username
 
-
* replace the 2 occurences of grimage-1 by the name of the Grimage node you actually use.
 
-
* make sure you get the ssh configuration to get the *.g5k trick to work (see the tip above)
 
=== I'd like to access resources stored outside of Grid'5000 (Internet) ===
=== I'd like to access resources stored outside of Grid'5000 (Internet) ===
Line 473: Line 592:
Ex: the OS of idgraf is currently pretty old. If requested, it could be upgraded, to provide CUDA 5 by default for instance, instead of CUDA 4 currently (as of 2013-06)
Ex: the OS of idgraf is currently pretty old. If requested, it could be upgraded, to provide CUDA 5 by default for instance, instead of CUDA 4 currently (as of 2013-06)
-
=== Any other question ? ===
+
=== I just deployed the default OS of a machine, and I cannot ssh to the machine with my user login ===
 +
Default environments of machines have restrictions regarding user logins: only root and the oar user can connect via ssh (required by the oarsh mechanism). If you deploy a default environment, then you must comment the last line in /etc/security/access.conf:
 +
-:ALL EXCEPT root oar:ALL
 +
In case of doubt, you can actually comment all the lines in the file.
 +
 
 +
Then you should be able to ssh to the machine using any valid user credential.
 +
 
 +
=== I lost my deploy job, can I get my system back? ===
 +
If your deploy job ended, but the machine(s) you were using is(are) available (no other deploy job in-between), you can create a new deploy job and just reboot the machine to your system instead of deploying it again, using kareboot3.
 +
 
 +
For instance, let say you deployed the jessie-x64-nfs environment on grimage-9:
 +
 
 +
kadeploy3 -m grimage-9.grenoble.grid5000.fr -e jessie-x64-nfs -u root -k
 +
 
 +
If you lose your job on grimage-9, but get a new one just right after the node rebooted back to the default environment (you'll notice a small period in state "absent"), you could just reboot the node to your system (on partition 3), without needing to deploy again:
 +
 
 +
kareboot3 -m grimage-9.grenoble.grid5000.fr -r recorded_env -e jessie-x64-nfs -u root -p 3
 +
 
 +
=== Any other question? ===
Please visite the Grid'5000 website: http://www.grid5000.fr
Please visite the Grid'5000 website: http://www.grid5000.fr
Line 479: Line 616:
* https://www.grid5000.fr/mediawiki/index.php/FAQ
* https://www.grid5000.fr/mediawiki/index.php/FAQ
* ...
* ...
 +
 +
Or see below the technical contact section.
== Resource usage visualization tools ==
== Resource usage visualization tools ==
Line 518: Line 657:
=== Gantt diagram of usage ===
=== Gantt diagram of usage ===
-
OAR [https://intranet.grid5000.fr/oar/grenoble/digitalis/ Drawgantt] diagram gives a view of the past, current and future usage of the machines.
+
OAR [https://intranet.grid5000.fr/oar/grenoble/digitalis/drawgantt-svg/ Drawgantt] diagram gives a view of the past, current and future usage of the machines.
-
(to see only one of the machine, you can set the ''filter'' paramater to one of the values shown in the select box, e.g. [https://intranet.grid5000.fr/oar/grenoble/digitalis/?filter=grimage%20only https://intranet.grid5000.fr/oar/grenoble/digitalis/?filter=grimage only] )
+
(to see only one of the machine, you can set the ''filter'' paramater to one of the values shown in the select box, e.g. [https://intranet.grid5000.fr/oar/grenoble/digitalis/drawgantt-svg/?filter=grimage%20only https://intranet.grid5000.fr/oar/grenoble/digitalis/drawgantt-svg/?filter=grimage only] )
=== Other OAR tools ===
=== Other OAR tools ===
Line 528: Line 667:
* etc.
* etc.
-
= Platform information and technical contact =
+
= Getting help / technical questions =
== Mailing lists ==
== Mailing lists ==
=== Dedicated list ===
=== Dedicated list ===
-
A mailing list is dedicated to the communication about the locally managed machine: [mailto:digitalis@lists.grid5000.fr digitalis@lists.grid5000.fr].
+
The [mailto:digitalis@lists.grid5000.fr digitalis@lists.grid5000.fr] mailing list is dedicated to the communication about the locally managed machines. You should be subscribed to that list, as soon as you belong to the Digitalis group (Grid'5000 account). Please make sure it is the case in your affiliation in [https://api.grid5000.fr/sid/users/_admin/index.html Grid'5000 user management system].
-
You'll get information through emails sent to this list, and you can also write to this list if you have to communicate something to the other users of the local machines.
+
-
You must be a member of the '''digitalis''' group (see/edit your affiliation in [https://api.grid5000.fr/sid/users/_admin/index.html Grid'5000 users management system]) to receive/send e-mails from/to this mailing list.
+
You'll get information about the platform via this list, but this is also a community list, to which you can send questions or answer to questions of other users.
 +
 
 +
'''The [mailto:digitalis@lists.grid5000.fr digitalis mailing list] is the preferred medium for any question regarding the platform.'''
=== Grid'5000 lists ===
=== Grid'5000 lists ===
-
Grid'5000 provide many mailing lists which any Grid'5000 user automatically receives (e.g. users@lists.grid5000.fr). Since the local machines benefit from global Grid'5000 services, you should keep an eye on information sent on those mailing lists to be aware of potential exceptional maintenance schedules for instance.
+
Grid'5000 provide several mailing lists which any Grid'5000 user automatically receives (e.g. users@lists.grid5000.fr, platform@lists.grid5000.fr). Since the local machines benefit from several global services of Grid'5000, you should keep an eye on information sent to those mailing lists, to be aware of potential exceptional maintenance for instance.
-
Be aware that '''Thursday is the maintenance day'''. Regular maintenances are programmed which may for instance impact the NFS service.
+
Also be aware that '''Thursday is the maintenance day''' on Grid'5000. Regular maintenances are programmed which may for instance impact the NFS service, and impact Grid'5000 machines as well as Digitalis local machines.
-
Please '''do not use the users@lists.grid5000.fr list for issue related to the local machines''', since the Grid'5000 staff is not in charge of those machines.
+
Please '''do not use the users@lists.grid5000.fr list for issues related to the Digitalis local machines''', since the Grid'5000 staff is not in charge of those machines.
== Grid'5000 Platform Events ==
== Grid'5000 Platform Events ==
-
Please also bookmark Grid'5000 platform events page, which list futurs events programmed for the platform. You can also subscribe to the RSS feed.
+
Please also mind having in your bookmarks the Grid'5000 platform events page. It lists the futurs events which will impact the platform. You can also subscribe to the RSS feed.
-
 
+
-
== Jabber ==
+
-
For any issue with the platform, you can contact me using Grid'5000 jabber. Feed free to add me to your buddy list: pneyron@jabber.grid5000.fr
+

Current revision as of 18:18, 16 January 2019

| Introduction | Usage | Idfreeze | Idgraf | Idphix | Idbool | Idkat | Idcin | Idarm | Ppol | Grimage |

Contents

THE DIGITALIS PLATFORM IS CLOSED / DISMANTLED

As of summer 2018, Digitalis time is over.

Overview

The Digitalis platform is an experimentation platform for research in distributed computing (parallel computing, system, networking). Digitalis is a satellite platform of Grid'5000, hosted within the site of Grenoble of Grid'5000.

The platform is managed by Pierre Neyron (LIG/CNRS).

The machines

Grid'5000 Grenoble clusters

Grenoble Grid'5000 site is composed of 3 clusters (as of 2012-03): genepi, edel and adonis. More information can be found on Grid'5000 Grenoble site pages. Those machines are handled by the Grid'5000 global (national) system, and managed by Grid'5000 engineer team. One must then refer to the Grid'5000 documentation to know how to use them. The next paragraphs of this page are mostly not relevant to those clusters.

Grid'5000 resources can be accessed indifferently in any Grid'5000 sites (i.e. Grenoble users are not restricted to Grenoble hardware).

One just need a Grid5000 account to access resources of Grid'5000.

CIMENT pole ID

As of 2014, the pole ID of CIMENT has no specific hardware in CIMENT (managed with the CIMENT stack). Grid'5000 Grenoble's site hardware is however used by CIMENT for some purposed like training (GPUs), etc. Also, CIMENT storage (Irods) is replicated on data storage in Grid'5000 network.

Other CIMENT resources (e.g. the Froggy cluster, 3000cores) can nevertheless be used. One must request an CIMENT access.

Digitalis local machines

Digitalis includes machines which are not managed by the Grid'5000 team, but benefit from many services provided by Grid'5000 (tight cooperation). First of all, access to those machines uses the Grid'5000 account credentials (more details below).

Grimage cluster

Cluster of the Grimage platform, and more.

Ppol cluster

3 recycled X86 machines from the Pipol platform: hybrid configuration of SSD + HDD and 10Gbps Ethernet.

Kinovis cluster

Currently the cluster (acquisition servers, ...) of the Kinovis platform supports the required functions for the aquisition platform only.

Two machines are however available: idcin.

Research teams' machines

Those machines are co-funded by several teams from LIG or Inria Grenoble (mostly Mescal & Moais) in order to provide experimental platforms such as:

  • new or complex processor architectures
  • large and complex SMP configurations
  • multi-GPU configurations
  • etc

The following machines are available:

As a courtesy to other researchers, those machines can be accessed when available.

Hardware summary table

Platform: Digitalis -> access via digitalis.grenoble.grid5000.fr
Machine CPU RAM GPU Network Other
grimage-1.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR31x GTX-680 (1GPU)IB DDRKeyboard/Mouse/Screen attached (4/3 screen, on the left, same as grimage-7)
grimage-2.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR3IB DDR + 1x 10GE (DualPort)2x Camera (firewire)
grimage-3.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR31x GTX-680 (1GPU)IB DDRKeyboard/Mouse/Screen attached (16/9 screen, on the right) + 2x cameras (firewire)
grimage-4.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR3IB DDR + 1x 10GE (DualPort)2x Camera (firewire)
grimage-5.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR3IB DDR + 2x 10GE (DualPort)2x Camera (firewire)
grimage-6.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR3IB DDR + 1x 10GE (DualPort)
grimage-7.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR31x GTX-580 (1GPU)IB DDR + 2x 10GE (DualPort)Keyboard/Mouse/Screen attached (4/3 screen, on the left, same as grimage-1)
grimage-8.grenoble.grid5000.fr2x Intel Xeon E5530 (16 cores)12GB DDR3IB DDR + 1x 10GE (DualPort)
grimage-9.grenoble.grid5000.fr2x Intel Xeon E5620 (16 cores)24GB DDR31x Tesla K40cIB DDR
grimage-10.grenoble.grid5000.fr2x Intel Xeon E5620 (16 cores)24GB DDR32x GTX-295 (4GPU)IB DDR
ppol-1.grenoble.grid5000.fr1x Intel Xeon E5620 (4 cores)12GB DDR3Ethernet 10Gbps1x HDD 500GB sata + 1x SSD 50GB-sata
ppol-2.grenoble.grid5000.fr1x Intel Xeon E5620 (4 cores)12GB DDR3Ethernet 10Gbps1x HDD 500GB sata + 1x SSD 50GB-sata
ppol-3.grenoble.grid5000.fr1x Intel Xeon E5620 (4 cores)12GB DDR3Ethernet 10Gbps1x HDD 500GB sata + 1x SSD 50GB-sata
idkoiff.imag.fr8x AMD Opteron 875 (16C)32GB DDR21x GTX-280 (1GPU)
idgraf.grenoble.grid5000.fr2x Intel Xeon X5650 (12 cores)72GB DDR38x Tesla C2050 (8GPU)
idfreeze.grenoble.grid5000.fr4x AMD Opteron 6174 (48 cores)256GB DDR3
idphix.grenoble.grid5000.fr1x Intel Xeon E5-2650 (8 cores)64GB DDR31x Xeon Phi KNC 5110P (61 cores) + 1x Nvidia Tesla K40c IB QDR
idbool.grenoble.grid5000.fr12x AMD Opteron 6376 (192 cores) 192GB DDR3 NumaConnect
idkat.grenoble.grid5000.fr4x Intel Xeon E7-4830 v3 (48 cores)256GB DDR4IB QDR, Ethernet 10Gbps2x SSD 300GB-sata
idcin-1.grenoble.grid5000.fr2x Intel Xeon E5-2697 v3 (28 cores)256GB DDR4 1x Nvidia Tesla K40c + 1x Nvidia GeForce Titan Black IB QDR, Ethernet 10Gbps 1x HDD 300GB-SAS
idcin-2.grenoble.grid5000.fr2x Intel Xeon E5-2697 v3 (28 cores)256GB DDR4 3x Nvidia GeForce Titan X IB QDR, Ethernet 10Gbps 1x HDD 300GB-SAS
idarm-1.grenoble.grid5000.fr1x ARM Cottex-A57 (2 cores) + 1x ARM Cortex-A53 (4 cores)8GB DDR3 nfsroot + 1x SSD 50GB-sata
idarm-2.grenoble.grid5000.fr1x ARM Cortex-A57 (2 cores) + 1x ARM Cortex-A53 (4 cores)8GB DDR3 nfsroot + 1x SSD 50GB-sata
Platform: Grid'5000 -> access via frontend.grenoble.grid5000.fr
Machine CPU RAM GPU Network Other
genepi-[1-34].grenoble.grid5000.fr2x Intel Xeon E5420 (16 cores)8GB DDR2IB DDR
edel-[1-72].grenoble.grid5000.fr2x Intel Xeon E5520 (16 cores)24GB DDR3IB QDR
adonis-[1-10].grenoble.grid5000.fr2x Intel Xeon E5520 (16 cores)24GB DDR31/2x S1070 (2GPU)IB QDR

Services

The Grid'5000 Network

The Digitalis platform takes benefit from Grid'5000 infrastructure, and first of all, uses Grid'5000 network. This allows a unified access to many resources France-wide, within a single network space. Machines from any Grid'5000 site can communicate without administrative restriction (access control), and with a very high throughput (10GE backbone).

However, since Grid'5000 is a very powerful scientific instrument, the outside world must me protected from buggy experiments or uncontrolled behaviors. Please read the following pages for information about this:

Consequences are that one cannot just use machines on Grid'5000 as one uses a workstation on an intranet of laboratory.

Dedicated services

Dedicated services are provided for the management of our machines. Indeed, our local machines couldn't fit in Grid'5000 model, due to their special characteristics and usage: The Grimage cluster is special in the fact that it uses to operate the Grimage platform with cameras and other equipments attached, making it's hardware configuration different. Other local machines are special in the fact that they are unique resources, which make their model of usage very different from the one of a cluster of many identical machines as found with Grid'5000 clusters.

As a result, a dedicated resource management system (OAR) is provided to manage the access to the machines, with some special mechanisms (different from the ones in Grid'5000). A dedicated instance of the deployment system (kadeploy) is also provided to handle user's customized operating systems that can be deployed on the machines. Even if different from the main Grid'5000 tools, many documentions of Grid'5000 also apply to our dedicated services. This document actually only explains their specificities.

OAR and Kadeploy frontend for Digitalis machines (i.e. not Grid'5000) is the machine named digitalis.grenoble.grid5000.fr.

Mutualised services (services provided by Grid'5000)

Many services we use on our local machines are provided by the Grid'5000 infrastructure, from a national perspective. For instance, the following services are provided for Grid'5000 but also serve our local purposes (by courtasy) :

  • access machines
  • accounts (LDAP)
  • network home directory storage (NFS)
  • web proxy
  • and more.

Please mind the fact that all services are not dedicated to our local needs.

Terms of service

Grid'5000 services are handled nationaly for the global platform (11 sites, France-wide). As a result, some aspects may seam more complex than the should from a local perspective. Please mind the fact that some services are not for our local conveniance only. Furthermore, the local platform is to be seen as an extension to the main Grid'5000 platform, that is not supported by the Grid'5000 national staff, even if we can freely benefit from some services they provide.

As a result, we are subject to the rules of the Grid'5000 platform:

  • Security policies: restricted access to the network, output traffic filtering.
  • Maintenance schedules: Thursday is the maintenance day, do not be surprised if an interruption of services happen on that day !
  • Rules of good behavior within the large Grid'5000 user community (please pay attention to the mailing lists)

If one is using the "official" Grid'5000 nodes, one must comply to the Grid'5000 charter (as approved by every user when requesting a Grid'5000 account)

Please do not forget to have the following acknowledgment written in any publication resulting from experiments performed on the Digitalis platform
Experiments presented in this paper were carried out using the Digitalis platform (http://digitalis.imag.fr) of the Grid'5000 testbed. Grid'5000 is supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). Access to the experimental machine(s) used in this paper was gracefully granted by research teams from LIG (http://www.liglab.fr) and Inria (http://www.inria.fr).

Also, please read the dedicated page for any of the machines you use, to see if any additional acknowledgment is requested.

Data integrity

There is not guarantee provided against data loss on the Grid'5000 NFS (home directories), nor on machines local hard drives. No backup is performed, so in case of an incident, the Grid'5000 staff will not be able to provide you with any way to get back any data.

As a result, if you have data you really care about, and cannot reproduce with an acceptable cost (time of computation) with regard to risks of data loss (which rarely happens), it is strongly suggested you back them up elsewhere.

(NFS storages uses RAID to overcome a disk failure, but RAID is not backup)

Platform usage

Charter of good usage

The charter of usage for the machines of Digitalis (except for Grid'5000 official machines which follow Grid'5000 Charter) is the following:

Communities

Users for the platform are split in 2 communities:

  • the owners of the machines (e.g. local users, buyers)
  • the others

The others are welcome to use the machines, but the owners keep priority and privileged rights (e.g. can possibly ask to drop jobs from others). In any case, everybody is encouraged to plan its experiments, and possibly to book resources in advance while trying to ask for reasonable (fair) shares of the resources (walltime).


Also, time is split in to phases: daytime and night.

During daytime
  • jobs should use the shared access as much a possible
  • if machine are obviously unused, one may consider running exclusive (or deploy) jobs, but please try to limit them to 2 hours max (possibly renewable, see redeploy job type for instance).
  • during high pressure periods, like before dead-lines, any usage by local users might preempt other usage.
During the night
  • night is everyday: 18:00 > 9:00, or week-ends 18:00 on Friday > 9:00 on Monday or holidays (like Christmas but not school holiday)
  • night is the time for long, exclusive jobs, for experiments requiring exclusive access to the resources (for performance reasons for instance)
  • however, if one just needs a long job, it is of course always preferred to run in the shared access mode

For now, the charter policy is not enforced by any technical mean, so everyone's kindness is appreciated.

Also, if one requires a special usage, out of the charter for the resources, one is encouraged to inform every other users using the mailing list digitalis@list.grid5000.fr.

Again, while trying to foster mutualisation as much as possible, owners of the machines keep higher priority and privileges.

Access to Digitalis

Get a Grid'5000 account

As a prerequisite to access Digitalis, your need to be able to access Grid'5000's network.

For that prupose, you require a Grid'5000 account. If you do not have one yet, please see: https://www.grid5000.fr/mediawiki/index.php/Grid5000:Get_an_account. Most likely, you should end up on the following form (relevant for for French academics)

Also, make sure your account belongs to the digitalis group.

If you do not know a Grid'5000 manager, set pneyron as you manager.

For the initial user report, please mention your intended usage of the Digitalis platform.

Access to Grid5000

Once you have a Grid'5000 account, you can ssh to the Grid'5000 network using ssh:

$ ssh access.grid5000.fr

In case of any issue at that point, please report to the documentation of Grid'5000.

From there you can access to the frontend of the Grid'5000 Grenoble site or other sites, by running

$ ssh grenoble

Or to any other Grid'5000 site, e.g.

$ ssh nancy

BUT the research team's machines of Digitalis are not manage by Grid'5000 Grenoble's frontend, see the next paragraph.

Please see also the tips and tricks section below which provides a lot of useful information to ease the access.

Access to the Digitalis local machines

The frontend machine to use Digitalis' resources is: digitalis.grenoble.grid5000.fr. From Grid'5000 access machine you can just do:

$ ssh digitalis.grenoble.grid5000.fr

Like for Grid'5000 machines (but with a slightly different charter), access to the teams' machines is controlled by a resource manager.

This means that users cannot just ssh to a machine and have processes indefinitely running on them (e.g. vi or emacs processes).

Any user must book the machine for a period of time (a job), during which access will be granted to him.

Once the period of time is ended, all rights are revoked, and all processes of the user are killed.

By default users are not root on the machines. Some privileged commands may however be permitted (e.g. schedtool). Default access to a machine is not exclusive, which means that many users can have processes on the machine at a same time, unless a user requested an exclusive access.

Just like on Grid'5000, this is possible on some machine to kadeploy. Special use cases indeed require full access to the machine: need to be root, to reboot the machine, or to install software or a different operating system, without breaking it for other.

As a result, you need to use the OAR commands to get access to the experimentation machines.

Use cases

I want to access a machine

To access a specific machine, just provide the machine name in the oarsub command:

pneyron@digitalis:~$ oarsub -I -p "machine = 'idgraf'"
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: .ssh/id_rsa
OAR_JOB_ID=1122
Interactive mode : waiting...
Starting... 

Connect to OAR job 1122 via the node idgraf.grenoble.grid5000.fr
pneyron@idgraf:~$ 

(Mind looking at the dedicated page of each machine for its details, e.g. for idbool, one must use the `-l machine=1' option to run a job on the whole machine).

You then get access to the machine for 1 hour by default (add -l walltime=4 for 4 hours).

Note that if the machine is not available (e.g. an exclusive job is already running), you will have to wait until it is free (see the resource usage visualization tools).

If no machine is specified, you get access to one of the grimage nodes.

You can use the oarsh command to open other shells to the machine, as long as the job is still running.

Please read the man pages of the OAR commands for more details.

I want to gain exclusive access to a machine for N hours

To get access to a machine as only user (e.g. in order to avoid noises from other users), use the exclusive job type:

pneyron@digitalis:~$ oarsub -I -p "machine = 'idgraf'" -t exclusive -l walltime=N
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: .ssh/id_rsa
OAR_JOB_ID=1122
Interactive mode : waiting...
Starting... 

Connect to OAR job 1122 via the node idgraf.grenoble.grid5000.fr
pneyron@idgraf:~$ 

This way you get access to the machine for N hours, and nobody else can access the machine during your job.

Note that if the machine is not available, you will have to wait until it is free (see the resource usage visualization tools).

Also, some privileged command can be run via sudo in exclusive jobs (see the machines' dedicated pages).

I want to open a new shell in an existing job

There are several ways to open a shell in a OAR job.

Assuming you created a job as follows:

[pneyron@digitalis ~]$ oarsub "sleep 1h"
Properties: 
[ADMISSION RULE] Modify resource description with type constraints
Generate a job key...
OAR_JOB_ID=6028

You can:

Use oarsub -C <job id>
[pneyron@digitalis ~]$ oarsub -C 6028
Connect to OAR job 6028 via the node grimage-8.grenoble.grid5000.fr
[OAR] OAR_JOB_ID=6028
[OAR] Your nodes are:
      grimage-8.grenoble.grid5000.fr*8

[pneyron@grimage-8 ~](6028-->58mn)$ 

NB: With this method, you do not need to know the nodes used by your job, but the job id. Also the environment is the same as when in the shell opened upon oarsub.

Use oarsh with the OAR_JOB_ID=<job id> environment variable
[pneyron@digitalis ~]$ OAR_JOB_ID=6028 oarsh grimage-8.grenoble.grid5000.fr
Linux grimage-8.grenoble.grid5000.fr 2.6.32-grimage #1 SMP Fri Jan 6 14:10:41 UTC 2012 x86_64
This is a Grid'5000 compute node.
You must have a reservation with OAR before using this host.
Last login: Fri Feb 21 16:54:42 2014 from mu2.grenoble.grid5000.fr
[pneyron@grimage-8 ~]$ 

NB: later on, you can also use oarsh on the node to connect from node to node (useful in multi node jobs)

Use oarsh with a job key

For that, create a public/private key pair on digitalis with no passphrase (for the sack of the ease of use and because this key should be for Grid'5000 internal usage only):

pneyron@digitalis:~$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/pneyron/.ssh/id_rsa):
[...]

Again: Do not use your existing sensible SSH keys here, for instance located on your workstation and protected by a passphrase of course !

Then export the OAR_JOB_KEY_FILE environement variable:

[pneyron@digitalis ~]$ export OAR_JOB_KEY_FILE=~/.ssh/id_rsa

You can also add the export line to you .bashrc if meaningful to you (make sure your .bashrc is sourced upon login, or look at your .profile or .bash_profile...)

You will now see that the oarsub command will use this key for your jobs.

[pneyron@digitalis ~]$ oarsub  "sleep 1h"
Properties: 
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: /home/pneyron/.ssh/id_rsa
OAR_JOB_ID=6029
[pneyron@digitalis ~]$

And you can connect to the job afterward:

[pneyron@digitalis ~]$ export OAR_JOB_KEY_FILE=~/.ssh/id_rsa # useless if export done in .bashrc 
[pneyron@digitalis ~]$ oarsh grimage-9.grenoble.grid5000.fr
Linux grimage-9.grenoble.grid5000.fr 2.6.32-grimage #1 SMP Fri Jan 6 14:10:41 UTC 2012 x86_64
This is a Grid'5000 compute node.
You must have a reservation with OAR before using this host.
Last login: Thu Feb 20 14:29:09 2014 from mu2.grenoble.grid5000.fr
[pneyron@grimage-9 ~]$

NB:

I want to run batch jobs, like on a regular HPC cluster

If you don't want your jobs to overlap like with shared and exclusive job types, you can use the batch job type. This job type activate OAR's original behavior, where one job waits for the termination of the previous job before starting.

Example:

  • First job:
[pneyron@digitalis ~]$ oarsub  -p "host like 'grimage-10.%'" -t batch 'sleep 2h'
Properties: host like 'grimage-10.%'
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: /home/pneyron/.ssh/id_rsa
OAR_JOB_ID=5795
  • Second job (-I is used here for the purpose of the demonstartion only)
[pneyron@digitalis ~]$ oarsub  -p "host like 'grimage-10.%'" -t batch 'sleep 2h' -I
Properties: host like 'grimage-10.%'
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: /home/pneyron/.ssh/id_rsa
OAR_JOB_ID=5796
Interactive mode : waiting...
[2014-01-31 22:10:47] Start prediction: 2014-01-31 23:11:43 (FIFO scheduling OK)

NB: batch jobs are exclusive (but not timesharing=*,user)

I want to execute privileged commands on my node

Within a exclusive job, some privileged commands can be run via sudo. Those authorized privileged commands typically have an impact on other users, hence they require an exclusive access (job) to the machine.

See the page dedicated to each machine for information about the available commands (grimage, idfreeze, idgraf, idphix).

If the privileged command you need is not available (available commands run without any sudo password prompt), you can ask your administrator whether it's possible to enable it. However, not all command can be safe, and if one is considered harmful to the system, it will not be made available. Please mind deploying your own operating system on the machine to get full privileges.

I want to be able to reboot a node without loosing my reservation

Rebooting a node kills jobs, therefor a special job type is provided to overcome this and allow rebooting nodes while keeping them booked. Unsurprisingly, this job type is named reboot (-t reboot). This type of job does not provide a shell on a node but on the frontend instead (just like deploy jobs). To get access to the nodes, the user must then run an exclusive job concurrently, and possibly several of them if they get interrupted by reboots.

Example of use:

pneyron@digitalis:~$ oarsub -I -t reboot -p "host like 'grimage-4.%'"
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=1129
Interactive mode : waiting...
Starting...

Connect to OAR job 1129 via the node 127.0.0.1
pneyron@digitalis:~$ 

Note that you get a shell on digitalis instead of on a grimage-4, unlike with an exclusive job.


While such a job is running, reboot can be performed either from the node (from the shell of an exclusive job) or from the frontend (digitalis).

Reboot from the node, as follows
pneyron@digitalis:~$ oarsub -I -t exclusive -p "host like 'grimage-4.%'"
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=1130
Interactive mode : waiting...
Starting...

Connect to OAR job 1130 via the node grimage-4.grenoble.grid5000.fr
pneyron@grimage-4:~$ 
pneyron@grimage-4:~$ sudo reboot
The system is going down for reboot NOW!enoble.grid5000.fr (pts/0) (Fri Jul 2
pneyron@grimage-4:~$ Connection to grimage-4.grenoble.grid5000.fr closed by remote host.
Connection to grimage-4.grenoble.grid5000.fr closed.
[ERROR] An unknown error occured : 65280
Disconnected from OAR job 1130
pneyron@digitalis:~$

(the interruption of the job due to the reboot causes some error that can be ignored of course.)

Reboot from the frontend as follows
pneyron@digitalis:~$ sudo node-reboot grimage-4.grenoble.grid5000.fr
[sudo] password for pneyron: 
*** Checking if pneyron is allowed to reboot grimage-4.grenoble.grid5000.fr
OK, you have a job of type "reboot" on the node, firing a reboot command !
--- switch_pxe (grimage cluster)
  >>>  grimage-4.grenoble.grid5000.fr
--- reboot (grimage cluster)
  >>>  grimage-4.grenoble.grid5000.fr
  *** A soft reboot will be performed on the nodes grimage-4.grenoble.grid5000.fr
-------------------------
CMD: ssh -q -o BatchMode=yes -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -o ConnectTimeout=2 -o UserKnownHostsFile=/dev/null -i /etc/kadeploy3/keys/id_deploy root@grimage-4.grenoble.grid5000.fr "nohup /sbin/reboot -f &>/dev/null &"
grimage-4.grenoble.grid5000.fr -- EXIT STATUS: 0
-------------------------
--- set_vlan (grimage cluster)
  >>>  grimage-4.grenoble.grid5000.fr
  *** Bypass the VLAN setting

NB: Please note that reboot jobs are exclusive.

Once rebooted, the user can get a new shell on the node by resubmitting an exclusive job, thanks to the reboot job which guarantees that no other user reserved the nodes in the meantime.

I want to change the system (OS, software) on the machine

Use the deploy type. See Grid'5000 documentation about kadeploy. The kadeploy installation on digitalis works the same way.

I want to book the machine for next night

OAR allows advance reservations

pneyron@digitalis:~$ oarsub -r "2012-04-01 20:00:00" -l walltime=4 -p "machine='idgraf'"
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: .ssh/id_rsa
OAR_JOB_ID=1125
Reservation mode : waiting validation...
Reservation valid --> OK
pneyron@digitalis:~$ 

Once your job starts (on April 1st, 8pm), you will be able to oarsh to the node.

See OAR's documentation for more information.

Q&A / Tips and tricks

What do the OAR status of nodes means exactly ?

As shown by the reservation diagram, or the chandler command, nodes can be:

  • Alive: either free or running a job, this is the normal state of a node
  • Absent: the machine is usually rebooting after a deploy job. This is a transitory state: node should be Alive again soon (a few minutes). If the machine stay in the Absent state longer, a problem probably occurred, in that case, this can be considered a abnormal state.
  • Suspected: an problem occurred, in the node clean-up for instance. Sometime, the state can be transitory, but the node can be considered in a abnormal state.
  • Dead: the node was retired for the reservation system by the administrator for some reason. This is not a abnormal state.

In case of abnormal state, you can try to reset the node by yourself: see the node-reboot command below. If this does not fix the issue, you can contact the administrator.

Access seems to be broken, what can I do ?

You normally access the Grid'5000 network by ssh'ing to access.grid5000.fr.

However, if that access machine is not reachable:

  1. Check for known issues on Grid'5000 incident page: https://www.grid5000.fr/status/
  2. Check your grid'5000 emails about possible outage or maintenance (planned or exceptional)
  3. Try other access paths to the grid'5000 network, try and cascade ssh as follows:
    1. from Internet > access-north.grid5000.fr > digitalis.grenoble.grid5000.fr
    2. from Internet > access-south.grid5000.fr > digitalis.grenoble.grid5000.fr
    3. from the intranet of Inria Grenoble or LIG > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
    4. from Internet: LIG bastion (e.g. atoum.imag.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
    5. from Internet: Inria Grenoble bastion (e.g. bastion.inrialpes.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
NB
  • You can hide those ssh cascades by playing with the ssh config file and proxycommands (see other tips and man ssh_config).
  • You might also want to benefit from a better bandwidth or latency by using the local access (access.grenoble.grid5000.fr).

Access to the Grid'5000 network is ok, but I can't reach digitalis nor grenoble

  1. Check for known issues on Grid'5000 incident page: https://www.grid5000.fr/status/
  2. Check your grid'5000 emails about possible outage or maintenance (planned or exceptional)
  3. Try to access Grenoble's sote directly with one of the following path of cascaded ssh:
    1. from Inria Grenoble or LIG > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
    2. from Internet: LIG bastion (e.g. atoum.imag.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr
    3. from Internet: Inria Grenoble bastion (e.g. bastion.inrialpes.fr) > access.grenoble.grid5000.fr > digitalis.grenoble.grid5000.fr

(connection to access.grenoble.grid5000.fr are restricted to local academic networks)

NB: If you can reach Grid'5000 Grenoble's site (fgrenoble.grenoble.grid5000.fr) but not digitalis.grenoble.grid5000.fr, that probably means that digitalis is broken. Please use the digitalis mailing list to report the problem.

I want to access to digitalis directly without having to go first to the access machine

Add to you ssh configuration on your workstation (~/.ssh/config):

cat <<'EOF' >> .ssh/config
Host *.g5k
ProxyCommand ssh pneyron@access.grid5000.fr -W "$(basename %h .g5k):%p"
User pneyron
ForwardAgent no
EOF

(replace pneyron by your Grid'5000 login)

Make sure you pushed your SSH public key to Grid'5000. see https://api.grid5000.fr/sid/users/_admin/index.html

Then you should be able to ssh to digitalis directly:

neyron@workstation:~$ ssh digitalis.grenoble.g5k
Linux digitalis.grenoble.grid5000.fr 2.6.26-2-xen-amd64 #1 SMP Tue Jan 25 06:13:50 UTC 2011 x86_64
[...]
Last login: Thu Mar 22 14:36:05 2012 from access.grenoble.grid5000.fr
pneyron@digitalis:~$

Or to copy files without needing a 2 hops operation:

neyron@workstation:~$ scp file digitalis.grenoble.g5k:/tmp/
file                                         100% 2783     2.7KB/s   00:00 

Same with rsync:

neyron@workstation:~$ rsync -av file digitalis.grenoble.g5k:/tmp/
sending incremental file list
file
sent 77 bytes  received 18 bytes  63.33 bytes/sec
total size is 15  speedup is 0.16
NB
This can be used to connect to any machine within Grid'5000 from the outside, assuming you can already ssh from the inside (Watch out: see below if you want to connect to a machine within a job, this needs oarsh then)

I want to ssh directly from my workstation to a job on a experimentation machine

(Note: This does not apply to the case of deploy jobs)

Make sure that the job you create uses a job key. See #I_want_to_open_a_new_shell_in_an_existing_job

You should have a ssh/job key in ~/.ssh/id_rsa.

Copy your keys on your worskation:

scp digitalis.grenoble.g5k:.ssh/id_rsa ~/.ssh/id_rsa_g5k
scp digitalis.grenoble.g5k:.ssh/id_rsa.pub ~/.ssh/id_rsa_g5k.pub

Add to your ssh configuration on your workstation (~/.ssh/config):

neyron@workstation:~$ cat <<'EOF' >> .ssh/config
Host *.g5koar
ProxyCommand ssh pneyron@access.grid5000.fr -W "$(basename %h .g5koar):6667"
User oar
IdentityFile ~/.ssh/id_rsa_g5k
ForwardAgent no
EOF

(replace pneyron by your Grid'5000 login)

Assuming you exported the OAR_JOB_KEY_FILE before doing the oarsub

[pneyron@digitalis ~]$ export OAR_JOB_KEY_FILE=~/.ssh/id_rsa # useless if export done in .bashrc 
[pneyron@digitalis ~]$ oarsub -p "machine='idgraf'" "sleep 1h"
Properties: machine='idgraf'
[ADMISSION RULE] Modify resource description with type constraints
Import job key from file: /home/pneyron/.ssh/id_rsa
OAR_JOB_ID=6031

Then you should be able to ssh directly to the machine from your workstation:

neyron@workstation:~$ ssh idgraf.grenoble.g5koar
Linux idgraf.grenoble.grid5000.fr 3.2.0-2-amd64 #1 SMP Sun Mar 4 22:48:17 UTC 2012 x86_64
[...]
pneyron@idgraf:~$

I want to push/pull data from/to the outside to/from a machine

There are several ways of pushing/pulling files from/to the outside.

Using NFS

Assuming your Grid'5000 user's NFS home directory is mounted on the destination machine, you can access files from there after copying them to it with one of the following command:

  • Using the global access machine:
neyron@workstation$ rsync -av file pneyron@access.grid5000.fr:grenoble/
  • using Grenoble local access machine (access restricted):
neyron@workstation$ rsync -av file pneyron@access.grenoble.grid5000.fr:

Then file is available in the home directory of all Grenoble machines:

neyron@machine$ ls /home/pneyron/
file

(replace pneyron by your Grid'5000 login)

Using the SSH proxy command setup

See above the setup of the .g5k and .g5koar SSH proxycommand. You can then run commands like

neyron@workstation$ rsync -av file digitalis.grenoble.g5k:/tmp/
neyron@workstation$ rsync -av file idgraf.grenoble.g5koar:/tmp/

I want my code to be pushed automatically to the machine

One can use inotifywait for instance.

To push files edited by vi for instance:

while f=$(inotifywait . --excludei '(\.swp)|(~)$' -e modify --format %f); do rsync -av $f remote_machine:remote_dir/; done

see

man inotifywait

A node is marked Absent or Suspected, how to fix it ?

Nodes stay Absent sometime after deploy jobs. While a short Absent time is normal during the reboot phase that follows the termination of a deploy job, having a long Absent time (more than 15 minutes) usually reveals a failed reboot. If you detect such a problem, please feel free to reboot the node again, from the frontend as follows:

pneyron@digitalis:~$ sudo node-reboot grimage-9.grenoble.grid5000.fr
[sudo] password for pneyron: 
*** Checking if pneyron is allowed to reboot grimage-9.grenoble.grid5000.fr
OK, node is absent or suspected, firing a reboot command !
--- switch_pxe (grimage cluster)
  >>>  grimage-9.grenoble.grid5000.fr
--- reboot (grimage cluster)
  >>>  grimage-9.grenoble.grid5000.fr
  *** A soft reboot will be performed on the nodes grimage-9.grenoble.grid5000.fr
-------------------------
CMD: ssh -q -o BatchMode=yes -o StrictHostKeyChecking=no -o PreferredAuthentications=publickey -o ConnectTimeout=2 -o UserKnownHostsFile=/dev/null -i /etc/kadeploy3/keys/id_deploy root@grimage-9.grenoble.grid5000.fr "nohup /sbin/reboot -f &>/dev/null &"
grimage-9.grenoble.grid5000.fr -- EXIT STATUS: 0
-------------------------
--- set_vlan (grimage cluster)
  >>>  grimage-9.grenoble.grid5000.fr
  *** Bypass the VLAN setting
pneyron@digitalis:~$

Rarely, nodes can also be marked as Suspected for an unknown reason (typically, OOM Killer waked up...). If a node stays Suspected for a long time, you can also try to reboot it, using the same command.

kaconsole3 is not working on idgraf

The IPMI stack of the BMC of idgraf is buggy. If you want to use the console but see that it is broken (no prompt), you can try to fix the BMC.

This is possible if you are in an exclusive job, by running:

sudo ipmi-reset

This is also possible if you are root (i.e. in a deploy job), by running

ipmitool mc reset cold

(Please do not play with other IPMI commands, since this will break the system).

NB: this reset takes a few minutes to complete.

How do I exit kaconsole3 ??

type "&."


I'd like to access resources stored outside of Grid'5000 (Internet)

Depending on what one wants to do, several options exists:

For the specific case of the access to source control repositories (SVN, Git,...), at least 2 options are possible:

  • Configure the HTTP proxy settings in your SVN or Git configuration, and use the webdav access method (i.e. not ssh). This requires that your repository server be white-listed (this is the case for common servers).
  • Checkout sources on your workstation and synchronize them to the experimentation machine (using any combination of the following tools: rsync, ssh with proxy command, inotify,...)

NB: Soon, a NAT service should be provided to allow to establish any kind of IP connection from inside Grid'5000 to the outside (Internet) for white-listed Internet destinations. This should make life easier.

The default OS on the machine I use does not provide what I need

If you thing deploying is overkill for your need, because you just need a single package more, or an upgrade of version that seems straight-forward, you are in right to ask for it to be applying in the default OS of the machine.

The good way to go is the following:

  1. test yourself that you are indeed right: use kadeploy to install a copy of the default OS of the machine by yourself in a job, hence getting full super user privileges.
  2. do what you think is good
  3. take note of every modification you did
  4. finally, ask your administrator (me) if those modifications could be applyied by default.

Ex: the OS of idgraf is currently pretty old. If requested, it could be upgraded, to provide CUDA 5 by default for instance, instead of CUDA 4 currently (as of 2013-06)

I just deployed the default OS of a machine, and I cannot ssh to the machine with my user login

Default environments of machines have restrictions regarding user logins: only root and the oar user can connect via ssh (required by the oarsh mechanism). If you deploy a default environment, then you must comment the last line in /etc/security/access.conf:

-:ALL EXCEPT root oar:ALL

In case of doubt, you can actually comment all the lines in the file.

Then you should be able to ssh to the machine using any valid user credential.

I lost my deploy job, can I get my system back?

If your deploy job ended, but the machine(s) you were using is(are) available (no other deploy job in-between), you can create a new deploy job and just reboot the machine to your system instead of deploying it again, using kareboot3.

For instance, let say you deployed the jessie-x64-nfs environment on grimage-9:

kadeploy3 -m grimage-9.grenoble.grid5000.fr -e jessie-x64-nfs -u root -k

If you lose your job on grimage-9, but get a new one just right after the node rebooted back to the default environment (you'll notice a small period in state "absent"), you could just reboot the node to your system (on partition 3), without needing to deploy again:

kareboot3 -m grimage-9.grenoble.grid5000.fr -r recorded_env -e jessie-x64-nfs -u root -p 3

Any other question?

Please visite the Grid'5000 website: http://www.grid5000.fr

Or see below the technical contact section.

Resource usage visualization tools

2 tools are available to see how resources are or will be used:

chandler

Chandler is command line tool, to run on digitalis. It gives a view of the current usage of the machines.

pneyron@digitalis:~$ chandler

4 jobs, 92 resources, 60 used
         grimage-1 	TTTTTTTT grimage-2 	TTTTTTTT grimage-3 	
TTTTTTTT grimage-4 	TTTTTTTT grimage-5 	         grimage-6 	
         grimage-7 	JJJJJJJJ grimage-8 	JJJJJJJJ grimage-9 	
         grimage-10 	TTTTTTTTTTTT idgraf 	

 =Free  =Standby J=Exclusive job T=Timesharing job S=Suspected A=Absent D=Dead

grimage-2.grenoble.grid5000.fr
  [1101] eamat (shared)

grimage-3.grenoble.grid5000.fr
  [1101] eamat (shared)

grimage-4.grenoble.grid5000.fr
  [1101] eamat (shared)

grimage-5.grenoble.grid5000.fr
  [1101] eamat (shared)

grimage-8.grenoble.grid5000.fr
  [1115] pneyron (reboot)

grimage-9.grenoble.grid5000.fr
  [1115] pneyron (reboot)

idgraf.grenoble.grid5000.fr
  [1113] jvlima (shared)
  [1114] pneyron (shared)

Gantt diagram of usage

OAR Drawgantt diagram gives a view of the past, current and future usage of the machines.

(to see only one of the machine, you can set the filter paramater to one of the values shown in the select box, e.g. https://intranet.grid5000.fr/oar/grenoble/digitalis/drawgantt-svg/?filter=grimage only )

Other OAR tools

All OAR command are available, see OAR's documentation.

  • oarstat: list current jobs
  • oarnodes: list the resources with their properties
  • etc.

Getting help / technical questions

Mailing lists

Dedicated list

The digitalis@lists.grid5000.fr mailing list is dedicated to the communication about the locally managed machines. You should be subscribed to that list, as soon as you belong to the Digitalis group (Grid'5000 account). Please make sure it is the case in your affiliation in Grid'5000 user management system.

You'll get information about the platform via this list, but this is also a community list, to which you can send questions or answer to questions of other users.

The digitalis mailing list is the preferred medium for any question regarding the platform.

Grid'5000 lists

Grid'5000 provide several mailing lists which any Grid'5000 user automatically receives (e.g. users@lists.grid5000.fr, platform@lists.grid5000.fr). Since the local machines benefit from several global services of Grid'5000, you should keep an eye on information sent to those mailing lists, to be aware of potential exceptional maintenance for instance.

Also be aware that Thursday is the maintenance day on Grid'5000. Regular maintenances are programmed which may for instance impact the NFS service, and impact Grid'5000 machines as well as Digitalis local machines.

Please do not use the users@lists.grid5000.fr list for issues related to the Digitalis local machines, since the Grid'5000 staff is not in charge of those machines.

Grid'5000 Platform Events

Please also mind having in your bookmarks the Grid'5000 platform events page. It lists the futurs events which will impact the platform. You can also subscribe to the RSS feed.

Personal tools
platforms