Changes between Version 13 and Version 14 of ComputeResources/UMCGCluster


Ignore:
Timestamp:
Jan 14, 2013 2:57:57 PM (11 years ago)
Author:
Pieter Neerincx
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • ComputeResources/UMCGCluster

    v13 v14  
    11== Description ==
    2 The UMCG cluster is composed of:
     2The GCC cluster @ UMCG is a 480 core PBS cluster described in detail here: [http://wiki.gcc.rug.nl/wiki/GCCCluster GCC Cluster]
    33
    4  * 1 head node
    5  * 10 compute nodes each with:
    6    * 48 cores
    7    * 256GB RAM
    8    * 2.3TB of local storage
    9    * 10Gb network connection to storage
    10  * 2 PB GPFS storage (only 1.1PB mounted at time of writing)
    11 
    12 The 10 nodes are dedicated for the GCC group at UMCG. GoNL being the most compute intensive project at GCC, most of the cluster can be used for it. The storage is shared by different groups in Groningen but there is currently no "hard limit" on how much space GoNL can use on the storage; this of course will only work as long as there is sufficient space for everyone.
    13 
    14 == Access ==
    15 Access to the UMCG cluster is done via:
    16  * SFTP via sftp.gcc.rug.nl - data access only, see [wiki:DataManagement/SftpServer SFTP page] for details
    17  * SSH  via cluster.gcc.rug.nl - to submit and monitor jobs. Connections from outside the UMCG/RUG network require a double hop via proxy.gcc.rug.nl, which can be automated as described in [http://wiki.gcc.rug.nl/wiki/TransparentMultiHopSSH TransparentMultiHopSSH]
    18 
    19 Additional accounts can be requested via Morris who keeps the list of all users that have full data access.
     4Access to the default *gcc* queue is available to all GoNL members. If you do not have an account yet, you can request one via Morris who keeps the list of all users that have full data access.
    205
    216== Usage ==
     
    2813   * Even if the local storage is periodically cleaned, if you store large files on a node while running a job you should clean afterwards. Small temp files are fine.
    2914 * '''Data Management''': Please read thoroughly the [wiki:DataManagement Data Management] section of this wiki and respect the structure and conventions described there when using data outside your home directory.
    30 == Scheduler ==
    31 Cluster.gcc.rug.nl uses the [http://doesciencegrid.org/public/pbs/ Portable Batch System (PBS)] scheduling system. You can find the full documentation in this [http://doesciencegrid.org/public/pbs/ PBS guide]. However, here are a few basic commands and tips:
    32 
    33  * qstat -u username
    34    * Shows a list of your jobs along with information and status
    35  * showq [-u username]
    36    * Shows the list of all jobs (if you use the -u flag, only the user's jobs) running on the cluster along with information and status.
    37  * checkjob jobid
    38    * Shows in-depth information about a specific job.
    39  * qsub jobScript
    40    * Submit a new job to the cluster. Note that is is important to submit your jobs with the appropriate options; See the qsub flags section below for a quick overview of the common options.
    41  * qdel jobid
    42    * Removes a job from the queue, killing the process if it was already started
    43    * "qdel all" can be used to purge all of your jobs
    44 == Available queues ==
    45 In order to quickly test jobs you are allowed to run the directly on cluster.gcc.rug.nl outside the scheduler. Please think twice though before you hit enter: if you crash cluster.gcc.rug.nl others can no longer submit or monitor their jobs, which is pretty annoying. On the other hand it's not a disaster as the scheduler and execution daemons run on physically different servers and hence are not affected by a crash of cluster.gcc.rug.nl.
    46 
    47 To test how your jobs perform on an execution node and get an idea of the typical resource requirements for your analysis you should submit a few jobs to the test queues first. The test queues run on a dedicated execution node, so in case your jobs make that server run out of disk space, out of memory or do other nasty things accidentally, it will not affect the production queues and ditto nodes.
    48 
    49 Once you've tested your job scripts and are sure they will behave nice & perform well, you can submit jobs to the production queue named ''gcc''.
    50 
    51 ||**Queue**||**Job type**||**Limits**||
    52 ||test-short||debugging||10 minutes max. walltime per job; limited to a single test node / 48 cores||
    53 ||test-long||debugging||max 4 jobs running simultaneously per user; limited to half the test node / 24 cores||
    54 ||gcc||production - default prio||none||
    55 ||gaf||production - high prio||only available to users from the gaf group||
    56 
    57 == Nodes ==
    58 
    59 It is not allowed to directly run tasks on the nodes. Always use the scheduler. If you want to test then use the test queues.
    60 
    61 == Local data ==
    62 
    63 If for some reason it is necessary to use local disk space instead of the GPFS then you need to request local disk space like this:
    64 {{{
    65 #PBS -l file=10mb
    66 }}}
    67 A private temp folder will be created. This folder will be removed after your script is finished. You can access this folder using this environment variable: $TMPDIR. Please make sure you do not use more disk space then requested.
    68 
    69 File created on local disk without using this system can be deleted without notice.
    70 
    71 === qsub options ===
    72 Jobs to be submitted via PBS qsub can specify a number of options to claim resources, report status, etc. These options can either be specified in the qsub command or in your job script. The latter is usually preferred as all information about the job including memory requirements, etc. stay with the script, below is an example header with some commonly used options, followed by a list of some commonly used flags and their meaning.
    73 
    74 '''`Example script header:`'''
    75 
    76 {{{
    77 #!/bin/bash
    78 #PBS -N JobName
    79 #PBS -q gcc
    80 #PBS -l nodes=1:ppn=1
    81 #PBS -l mem=10mb
    82 #PBS -l mem=4gb
    83 #PBS -l walltime=12:00:00
    84 #PBS -o /target/gpfs2/gcc/home/lfrancioli/output.log
    85 #PBS -e /target/gpfs2/gcc/home/lfrancioli/error.log
    86 
    87 #Here comes your bash script commands
    88 
    89 echo "Hello World!"
    90 }}}
    91 '''Commonly used options:'''
    92 
    93  * -q queueName
    94    * Selects which queue the job should be put in. The only queue available at the moment is 'gcc'
    95  * -N jobName
    96    * Set the job name
    97  * -l `nodes=X:ppn=Y`
    98    * `Requests X nodes and Y cores per node`
    99  * `-l mem=Xgb`
    100    * `Requests X GB RAM`
    101  * `-l walltime=12:00:00`
    102    * `Sets the walltime to the specified value (here 12hrs). This flag should be set.`
    103  * `-j oe`
    104    * `Redirects all error output to standard output`
    105  * `-o outputLog`
    106    * `Redirects the standard output to the desired file. Note that using '~' in the path for you home directory does not work.`
    107    * `Note that the standard output is first written on the local node and only copied once the job terminates (regardless of the reason of the job termination).`
    108  * `-e errorLog`
    109    * `Redirects the error output to the desired file. Note that using '~' in the path for you home directory does not work.`
    110    * `Note that the error output is first written on the local node and
    111  only copied once the job terminates (regardless of the reason of the
    112 job termination).`