Creating a PBS/MPI cluster for bioinformatics – Part 3

This is the third and last part of this blog series. In this post we will install Structure (Pritchard Lab) and Torque PBS. We will configure a simple run in Structure using the two machines in our cluster.

Installing Structure

Installing structure is very simple. Download the latest version from Pritchard Lab page, decompress it and move the executables to a folder in your $PATH (or use symlinks). Here I’m using /usr/local/bin, but to keep things in order, I renamed the console folder (from structure_linux_console.tar.gz) to structure-console-2.3.3, because I like to know the tool and its version without having to browse it. Then I moved it under /opt/biouno, where I keep the executables used by the cluster. Finally, I created the symlink /usr/local/bin/structure that points to /opt/biouno/structure-console-2.3.3/structure.

You will have to execute the steps above in all the nodes. In a real cluster facility, probably you would use some automated deploy system, or NFS, or something similar to simplify (imagine if you have 30 computers and have to install Structure in all of them). Structure does not use MPI for parallelism, instead you have to break your job into smaller pieces. Here’s the link for the thread in the Structure Google Groups about it. So first let’s run a standalone execution to understand how to run in a cluster.

We will use the Structure example from Bodega Phylogenetics’ Wiki. Download marm_struct.input. This will be our input file, now we need to create the mainparams and the extraparams files. For the sake of simplicity, you can get these files from this GitHub repository. Put all the files in a directory and, from this directory, execute structure. It should take a few minutes to finish, and will output a lot of information to your terminal.

Installing Torque PBS

There are several tutorials for installing Torque PBS. The one that worked the best with us is this one. Following all the steps, and trying for a while with different configurations, you will manage to set up your PBS cluster.

Basically, what you will need is 1) download Torque PBS from Adaptive Computing website, 2) compile and install in your local machine, 3) generate the tarballs for the other machines in your cluster and 3) set up some configuration files for your environment. This last step is the trickiest one. Here’s our mom configuration file.

# /var/spool/torque/mom_priv/config @ server $pbsserver chuva.kinoshita.eti.br $clienthost chuva.kinoshita.eti.br $logevent 255 $cputmult 1.0 $wallmutt 1.0 $max_load 1.0 $ideal_load 1.0 $restricted *.kinoshita.eti.br

# /var/spool/torque/mom_priv/config @ slave $pbsserver chuva.kinoshita.eti.br $logevent 255

Running structure in parallel

In the beginning of this post, we installed and executed Structure. The mainparams defined two populations. The number of populations is normally called K parameter. The Bodega Phylogenetics’ Wiki has some good information about this, but you can find more on the Internet and in scientific papers.

As it is not so easy to define the value of K, most users have to run Structure many times and analyse the output data of each run. This execution using different values of K can be run in parallel. Let’s create a sample input file for our PBS cluster, to distribute one value of K for each node in our cluster.

First, copy the current folder, and update the mainparams in one of the folders. You must change MAXPOPS from 2 to 1. Both configurations, with K=1 and K=2 must be copied to your NFS partition, and then you have different options to split them through your PBS cluster. You can name the folders with the machine names, and use the hostname environment variable in your shell script (the one you are submitting to your PBS cluster), or you can code some mechanism to select the right configuration file.

One last note, the default schedule management in Torque PBS is very simple, probably real computer facilities use other tools, like Moab or Maui in conjunction with the batch scheduling software.

So far configuring our small cluster for BioUno has been fun, but there is room for improvements in the cluster management, and in the integration of the cluster with Jenkins. We will take a look at the existing plug-ins for cloud computing, and using API’s like pbs4java, we expect to enable Jenkins and BioUno users to use a PBS cluster to build software or to create biology workflows. Stay tuned for more posts on biotechnology and BioUno!