Compara [Meyer Lab's high-performance cluster]

1 What is compara ?

Compara is a computer cluster for research in genomics, hence its name. The cluster:

Overview:

normal nodes: 16 x 72 cores / 386G / 440G /scratch ; 60G /tmp
big nodes: 2 x 80 cores / 1.5 TB / 1.3TB /scratch ; 390GB /tmp

runs Linux as operating system

employs Torque to submit and schedule jobs

2 How do I get started ?

Before you can actually submit any jobs to compara, you first need to check a few things.

make sure you have an MDC username/login and also that you are a member of the Meyer Lab

You can access the cluster directly from any desktop PCs on Campus-LAN and WiFi (mdc-intern,~~mdcguest~~,~~eduroam~~). Just SSH (Secure SHell) connect to the login node meyerc-login01 with your username and password.

On UNIX-like system (Linux, MacOS), type: ssh username@meyerc-login01.mdc-berlin.net (with MDC-password)

on Windows system, you need to download and use an SSH tool, such as PuTTY or MobaXterm

3 How do I submit jobs ?

The main idea behind having a cluster and a job queuing system is that jobs are distributed onto the nodes of the cluster, making it more efficient than interactive jobs which get manually submitted on individual machines. This is the main reason why interactive jobs are not allowed.

In order to submit jobs, you need to first login to compara, the head node of compara, for example using ssh -X -l myself@meyerc-login01 by replacing myself with your own login id. On compara, you may edit files, for example your job submission scripts, but other than that interactive jobs are not allowed and will be killed.

For submitting a job, there are two options:

option 1 (preferred): writing a script and submitting this with qsub using:
qsub my_script.sh

option 2: submitting your job which requires my_command_line directly with qsub using:
qsub [qsub_options] my_command_line
Always submit a test job before submitting any numbers of real jobs.

Option 1 has many advantages. It allows you to move all the qsub_options which would otherwise have be typed directly onto the command line to the script. You can thus re-run jobs easily and also have a full record of how each jobs was submitted.

4 Basic commands & qsub ?

Our GridEngine offers the following basic commands, tools and activities to accomplish common user tasks in the cluster:

Task	Command
submit jobs	qsub, qresub, qrsh, qlogin, qsh, qmake, qtcsh
check job status	qstat
modify jobs	qalter, qhold, qrls
delete jobs	qdel
check job accounting after job end	qacct
check cluster messages after job fails	qmesg
display cluster state	qstat, qhost, qselect, qquota
display node state	qwho, qhost
display cluster configuration	qconf

qsub is the command used to submit jobs to the cluster. Here is simple example to make sure things are working:
echo date | qsub

The qsub command has a number of options which you should use to specify how your jobs should be run. The manual page for qsub give you a full list of commands. The most commons qsub_options ones are:

-d path: Defines the working directory path to be used for the job executable. Make sure you specify the full path.
-o filename: Defines the path to be used for the standard output stream of the job. Make sure you specify the full path.
-e filename: Defines the path to be used for the standard error stream of the job. Make sure you specify the full path.
-q queuename: Sends the job to the specified queue.
-l resources_list: Specifies the resources to be allocated to this job. Example: mem and cput, see all options and examples at man pbs_resources.
-t [int]-[int]: Starts a job array. Sends multiple copies of the job to the cluster, each with a different task id in the range [int]-[int]. This is a good way of running the same job many times (i.e. with several different parameter settings). In the submitted script, the task id can be accessed through the environment variable PBS_ARRAYID, so the script can check its own task id to run job with the appropriate settings.

When using a script my_script.sh to submit jobs, all of the above qsub options can be moved to the script by placing lines starting with #PBS [option] after the shebang line. Example:
#!/bin/bash #PBS -d /ubc/cs/research/my_directory/ #PBS -q slow #PBS -l mem=10G,cput=60 my_command_line

5 Torque/PBS setup & storage?

First, it is a good idea to always specify a job queue. This helps schedule more jobs more efficiently and ensures every job gets the correct time and memory resources. There are two sets of queues, those with default priority and those with high priority. Unless you have a compelling reason to use the high priority queues, you should use the default ones. Please check with designated IT people first, if you want to use the high priority queues.

/data/meyer is the share stored on fs02, common with /data/meyer on your workstations and max-cluster and available (via nfs) on all meyerc* nodes. knowing already the history of this share , we'd recommend to avoid running any computations implying it (try to limit just to copy data from and to it)

/data/basecalls is mounted over nfs from fs02 acdn can be used only read only

/data/meyerc is a filesystem stored on meyerc-fs01 and it is available only on meyerc* nodes. as it was designed just for storing data, we'd recommended to avoid using it in computations, except maybe just to read from it and write the output on /scratch

/home/{username} is shared across all meyerc* nodes from meyerc-fs01, available only inside your cluster and it is not the same /home as on max-cluster.

/scratch is a local SSD and there is the place where all computations I/Os should go; before starting using it, please be aware of it's size and also that it is not shared across multiple nodes

Resources' information:

Create and define queue default

set queue default queue_type = Execution
set queue default resources_max.mem = 382gb
set queue default resources_max.nodes = 16
set queue default resources_max.walltime = 24:00:00
set queue default resources_max.nodect = 16
set queue default resources_default.nodes = 1
set queue default resources_default.walltime = 24:00:00
set queue default resources_default.ncpus = 1
set queue default resources_default.mem = 8gb
set queue default resources_default.neednodes = small
set queue default resources_default.nodect = 1
set queue default acl_group_enable = True
set queue default acl_groups = AG_Meyer
set queue default acl_groups += bimsb_itsupport
set queue default acl_group_sloppy = True
set queue default enabled = True
set queue default started = True

Create and define queue bigmem

set queue bigmem queue_type = Execution
set queue bigmem resources_max.mem = 1530gb
set queue bigmem resources_max.nodect = 2
set queue bigmem resources_max.walltime = 96:00:00
set queue bigmem resources_default.nodect = 1
set queue bigmem resources_default.walltime = 96:00:00
set queue bigmem resources_default.ncpus = 1
set queue bigmem resources_default.mem = 16gb
set queue bigmem resources_default.neednodes = big
set queue bigmem acl_group_enable = True
set queue bigmem acl_groups = AG_Meyer
set queue bigmem acl_groups += bimsb_itsupport
set queue bigmem acl_group_sloppy = True
set queue bigmem enabled = True
set queue bigmem started = True

Create and define queue longtime

set queue longtime queue_type = Execution
set queue longtime resources_max.mem = 382gb
set queue longtime resources_max.nodect = 16
set queue longtime resources_max.walltime = 9999:00:00
set queue longtime resources_default.nodect = 1
set queue longtime resources_default.walltime = 336:00:00
set queue longtime resources_default.ncpus = 1
set queue longtime resources_default.mem = 8gb
set queue longtime resources_default.neednodes = small
set queue longtime acl_group_enable = True
set queue longtime acl_groups = AG_Meyer
set queue longtime acl_groups += bimsb_itsupport
set queue longtime acl_group_sloppy = True
set queue longtime disallowed_types = interactive
set queue longtime enabled = True
set queue longtime started = True

Create and define queue longbigmem

set queue longbigmem queue_type = Execution
set queue longbigmem resources_max.mem = 1530gb
set queue longbigmem resources_max.nodect = 2
set queue longbigmem resources_max.walltime = 9999:00:00
set queue longbigmem resources_default.nodect = 1
set queue longbigmem resources_default.walltime = 336:00:00
set queue longbigmem resources_default.ncpus = 1
set queue longbigmem resources_default.mem = 16gb
set queue longbigmem resources_default.neednodes = big
set queue longbigmem acl_group_enable = True
set queue longbigmem acl_groups = AG_Meyer
set queue longbigmem acl_groups += bimsb_itsupport
set queue longbigmem acl_group_sloppy = True
set queue longbigmem disallowed_types = interactive
set queue longbigmem enabled = True
set queue longbigmem started = True

6 What are my jobs doing ?

You can use qstat to check the status of your jobs. You can also use the -m flag of qsub to get an email when your jobs is finished. By default, only active jobs will show, i.e. those that are still running. Check the manual page of qstat for how to get more detailed information on your jobs. Some useful flags are -q to see the global status of each queue, -f for details of each job, and -n for node information of each job.

If you realize you want to stop one of your jobs, you should first get the job's ID using qstat. You can then kill the job using qdel job_id. Alternatively, qdel all will kill all of your own currently running jobs. Have a look at the manual page of qdel for more details.

7 Shared software: GUIX

a shared guix is available with the management on meyerc-guix server. If the personal profiles can be managed from any node, the shared profiles can be adjusted only from meyerc-guix server and only by the members of compara/meyerc_admins.

a shared profile can be created like this

the custom guix-bimsb repository is not enabled by default, but you can use it by running git clone https://github.com/BIMSBbioinfo/guix-bimsb.git) inside folder /gnu/custom_repos/ and then use GUIX_PACKAGE_PATH= /gnu/custom_repos/.../ variable (see http://guix.mdc-berlin.de/documentation.html#sec-7-2)

if a software is not available via guix or it is a precompiled binary , you can install it (drop it) on the shared location /usr/loca/shared/ which, again, can be done only from meyerc-guix server and only by the members of meyerc_admins and compara admins

/usr/loca/shared/bin/ is already available in default $PATH for all users on all computing nodes

8 Houston, I think we have a problem ...

Things can go wrong for a number of reasons. Here is more information on whom to contact when.

If you have problems making sense of any or some of the above, read the corresponding manual page provided by the HPC team HPC User Guide.

Please note that we rely on all users of compara to not abuse the system. If you created a mess which could have been easily avoided using test jobs and an appropriate combination of qsub flags, you risk being removed from the list of compara users.

If you have suggestions on how to generally improve the performance of compara, or you encountered some other issue, please email Dan.Munteanu@mdc-berlin.de cc-ing Irmtraud

Managed by

Meyer Lab at BIMSB.

Updated on 29/10/2020