Using the Robotarium Cluster: Part 1

Introduction

Over the summer and early fall of last year (2014) I got the opportunity to use a cluster to run some of my experiments for my master’s thesis. The experiments revolved around collecting performance metrics for varies benchmarks as part of a comparison between two different types of multi-threaded platform agnostic programming languages. The purpose of this post is not though to discuss my master’s thesis, but instead to write about how to use a high-performance cluster. Specifically I will walk through a few examples and use-cases – these will appear as a series of posts over the next few months or so. The examples will make use of generic software packages and tools, but are specific to the Robotarium cluster (the website is currently only accessible from within the Heriot-Watt University, a solution to this is being worked on), which is available to students and staff of the EPSRC Centre for Doctoral Training in Robotics and Autonomous Systems. Details on how to access the cluster can be found on its website or you can send me an email at hv15 ÁT hw.ac.uk.

For the uninitiated, a cluster is a collection of computers connected by a network. In cluster computing, each computer (called a node) works in isolation on a task of a particular application that is executing over some or all compute nodes within the cluster. This can take two forms, either each instance of the application is working on completely different datasets – in essence accelerating the overall computation over multiple datasets. Or the application distributes a single dataset across all of its instances which compute over a unique index range thereof. Hardware resources across the cluster can all be used by one or more applications, which is managed through some form of batch/resource management system that ensures that resources are not oversubscribed (among other things).

The importance of cluster computing is that with relatively inexpensive hardware, one can build up a system that can compete with purpose built supercomputers like those created by Seymour Cray (seriously, read up about this guy, pretty much created the supercomputer, and would dominate super-computing for decades) in both performance and cost. This is why practically every system on the TOP500 list of most powerful computing systems are clusters.

That’s the basic outline as to what makes a cluster useful. Now on to a real world cluster: the Robotarium cluster. Before we get going on the examples part of this post, some background information about the cluster is in order. It is made up of a collection of 10 compute nodes and one head node, as well as eight dedicated Intel Xeon Phi co-processors. The head node acts as the main entry point to the cluster, from where users can launch their applications on the cluster. The compute nodes have between 16 and 64 CPU cores available as well as 512GB to 1024GB of RAM; eight nodes additionally have one nVidia K20 GPU each. The cluster was built with funds made available by the EPSRC Centre for Doctoral Training in Robotics and Autonomous Systems.

For this post, I will walk through the setup needed to run an application on the cluster using a simple example application. Later posts will discuss using technologies such as MPI and allocating none-CPU resources such as GPUs and the Xeon Phi co-processors.

Basics

For a user, using a cluster – even one as small as the Robotarium cluster – can seem daunting. Spinning up several compute nodes and having your application execute on these doesn’t need to be difficult though. The trick to effectively using a cluster starts with the right application – that is to say, the application implemented such that it can compute over many different datasets independently or be able to distribute a dataset across the nodes and work on parts of it. For example, for my master thesis work (I know I said I wasn’t going to talk about it but oh well) I ran the same application on each compute node with different datasets. In essence computing over 10 datasets (one for each compute node) at the same time.

The next steps involve setting up the environment for the application, such as making sure software dependencies are met, and allocating resources. The next two sections will introduce the concept of software modules and the batch management system used on the cluster. I have created a small application and will use this for the examples.

Setting up the environment

It’s impractical for both users and administrators to provide software resources on a per-user/per-application basis. If one considers that many applications need the same or similar software resources, like libraries, providing them in a modular fashion seems like a good idea. The Environment Modules Project came up with a solution that does just that, by providing the software resources as a collections of modules which provide specific versions of software. This facility is provided by the module application, which is a common way on many clusters to gain access to software packages and libraries not actually installed on the nodes. Modules provide a particular version of software, e.g. gcc/5.2.0 or openmpi/gcc/64/1.8.5.

To view a complete list of modules available on the cluster, we can call module with the avail command:

$ module avail
/cm/shared/modulefiles:
acml/gcc/64/5.3.1
acml/gcc/fma4/5.3.1
acml/gcc/mp/64/5.3.1
acml/gcc/mp/fma4/5.3.1
acml/gcc-int64/64/5.3.1
acml/gcc-int64/fma4/5.3.1
acml/gcc-int64/mp/64/5.3.1
acml/gcc-int64/mp/fma4/5.3.1
... // <1>
  1. There are a lot more modules available, but these have been omitted here for brevity

The first line in the module avail output shows where the module files have been read from, these are listed directly below. For example the cluster has the AMD Core Math Library version 5.3.1 available (acml/gcc/64/5.3.1).

To show what modules we have already loaded we can use the list command:

$ module list
Currently Loaded Modulefiles:
gcc/5.2.0
slurm/14.03.0

As you can see, I only have two modules loaded when I login. This is done through my .bashrc file, there I call module with the load command:

My .bashrc file

## stuff omitted

## load modules
module purge // <1>
module load gcc
module load slurm

## more stuff omitted
  1. See further below for what this does.

You’ll notice in my .bashrc file that I don’t explicit give the full module name for either gcc or slurm. The module command is fairly smart and will automatically select the latest version of the given module. So for example, there are two versions of gcc available on the cluster: 4.8.4 and 5.2.0; without specifying which version of gcc I want, module loaded version 5.2.0 for me (see my list from above).

The module application isn’t however smart enough to make sure you don’t load different versions of the same module at the same time, e.g. gcc/4.8.4 and gcc/5.2.0. This can lead to bizarre issues occurring such as linking errors when compiling. It is always good practice to give the whole module name when loading.

If one does get into this situation, or just wants to remove a loaded module, we can use the unload command:

$ module unload gcc/5.2.0
$ module list
Currently Loaded Modulefiles:
slurm/14.03.0

Or if we want to remove all loaded modules, we can use the purge command:

$ module purge
$ module list // <1>
  1. There is nothing to print back.

Allocating and executing

On the Robotarium cluster this is achieved through the SLURM (Simple Linux Utility for Resource Management) batch system. SLURM organises all the nodes and their resources into a partition (a queue) from where a user can request resources for their application. This is done through one of SLURM’s user space tools – sbatch and srun. The first tool reads in configurations/settings from both the command line and from a batch script; the batch script is basically a wrapper around the application and is used to setup the environment and anything else a user might intend to happen before or after their application. After reading in the batch script, sbatch submits it to the batch system for later execution. srun is similar to sbatch, except that it launches the application immediately and provides the user with its stdout and stderr output in real-time. An example of a batch script for SLURM is given (this is the template that I use most of the time):

batchscript.sh

#!/bin/bash

#SBATCH --partition=<name of partition>
#SBATCH --nodes <number of nodes wanted>
#SBATCH --mail-user=<email address to notify>
#SBATCH --mail-type=<when to notify user: FAIL, QUEUE, ALL> 

# do some stuff to setup the environment
module purge // <1>
module load gcc/5.2.0

# extra stuff
#blah
#blah

# execute application (read in arguments from command line)
./application $@

# exit
exit 0 // <2>
  1. It is always a good idea to clear the modules before executing ones application, this makes sure that the environment is clean.
  2. sbatch uses the status return from the batch script to check that the application has completed. If it is non-zero, sbatch reports that the applications has failed – if you use email notifications you’ll get some false-positives.

The batch script file is only an example, by that I mean that one can use any scripting language, like Python and Ruby. If you really wanted to you could also use JavaScript as well… so long as the appropriate interpreter is available via the shebang. At any rate, the purpose of the script is to encode resource requirements for the application. In the batch script file this is defined through the #SBATCH comment directive, which is followed by a single command line argument. In our example, the first line indicates which partition we wish to use, the second specifies how many nodes we want and the two last lines tell SLURM that we want to be notified via email about certain events. More details can be found in the manpages for SLURM. The next few lines make use of the module application to setup the environment for the application – in this case we load gcc.

Lets start with a basic example application – the following C code prints out the hostname of the computer:

hostname.c

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/utsname.h>

int main() {
    struct utsname * system; // <1>
    int error;

    system = malloc(sizeof(struct utsname));
    error = uname(system);

    if(error)
    {
        perror("Whoops, something went wrong: ");
        return EXIT_FAILURE;
    }

    printf("%s says woof :D\n", system->nodename);
    free(system);

    return EXIT_SUCCESS;
}
  1. Structure contains various bits of information, see http://linux.die.net/man/2/uname for more information

We can compile hostname.c by calling GCC: gcc -std=c99 -Wall -o hostname.out hostname.c. If we run hostname.out we should get an output that resembles foobaz says woof :D.

If we use srun to execute hostname.out on the cluster we get:

$ srun ./hostname.out
gpu01 says woof 😀

By using srun without any command line arguments, we automatically get one node allocated in the default partition. In this instance, we got the node gpu01. If we run the same command as above but with --nodes 2, we get the following

$ srun --nodes 2 ./hostname.out
gpu01 says woof 😀
gpu02 says woof 😀

Now we have two nodes executing hostname.out independently, gpu01 and gpu02. We can continue doing this until we hit a resource limit:

$ srun --nodes 42 ./hostname.out // <1>
srun: error: Unable to allocate resources: Node count specification
invalid
  1. we don’t actually have 42 nodes, only 10 :-)

What about running hostname.out as a batch job? Well for that we use sbatch, which will execute hostname.out in the background. For this though, we need to create a batch script – lets use the one given in the example batch script with a few modifications:

hostname.sh

#!/bin/bash

#SBATCH --nodes 2

# do some stuff to setup the environment
module purge
module load gcc/5.2.0

# execute application (read in arguments from command line)
./hostname.out $@

# exit
exit 0

An advantage to a batch script is that we can execute the application by calling the batch script, meaning that environment settings and other setup is self-contained within the script. With our batch script, we can call sbatch just like we would srun:

$ sbatch ./hostname.sh Submitted batch job 10010

Two things will happen:

  1. The batch script is queued and waits till all resource requirements are fulfilled before being executed.
  2. The stdout and stderr streams will be written to the file slurm-<JOB NUMBER>.out.

If we open up the slurm-10010.out file, we get the following:

$ cat slurm-10010.out
gpu01 says woof 😀
gpu02 says woof 😀

The result is identical to when we used srun above.

End of Part 1

We have now reached a good point to end this blog post. In the next blog on Using the Robotarium Cluster, I’ll expand hostname.c to make use of MPI and probably will give a walk through on how to allocate the GPUs. Later blog posts will look at interfacing withe Xeon Phis.

Front Image: Kindly provided by the Edinburgh Centre for Robotics.