Creating a PBS/MPI cluster for bioinformatics - Part 1

This is a series of posts with the steps we are using to set up an internal cluster for bioinformatics using MPI and PBS for distributed jobs. Our goal with this series is help other nerds setting up similar environment. This cluster is not public yet, and we are using for development of BioUno. At moment we are writing code to integrate our Jenkins with MrBayes and Structure.

In this first post, we will see how to configure the network for each machine be able to see the other one, and how to configure SSH for connecting to the other nodes in the cluster. If you already know MrBayes and Structure, skip the Introduction and go to the basic network configuration topic. We assume you have a intermediary knowledge of Linux, as many instructions in this series are vague. We always reference the original tutorial that we used. This series may be edited to stay up to date or for improvements.

Introduction

MrBayes is a tool for bayesian inference phylogeny using MCMC (Monte Carlo Markov Chain) methods. It is written in C and distributed under GNU General Public License. In a cluster, probably you will want to run MrBayes from command line. it also supports MPI, what can permit your cluster to execute MrBayes using more computer power. We will use OpenMPI as MPI implementation.

Jonathan Pritchard runs Pritchard Lab, responsible for maintaining Structure and a lot of other useful tools. Structure is used to investigate population structure. Structure is written in C, with an optional Java FrontEnd and is free but doesn’t make it clear which license it is licensed under.

When running Structure in a cluster environment, probably you will want to run from command line too, just like MrBayes. However, it doesn’t use MPI for distributed computing. Instead we will use a job scheduler tool to invoke Structure on different computers using a different K parameter value. Then we will gather all result files to finish the process. We will use Torque PBS implementation.

The operational system used is Debian, but the same instructions might work in other Linux distributions, specially those based on Debian, such as Ubuntu and Mint.

Basic network configuration

Your computers must be able to talk to each other. What means that at least ping must be working in both directions. In our cluster, we have all the nodes connected to a simple hub, configured within network 192.168.0.0/24. Each computer has its own static IP address. We have an old D-Link DL-524 router with wi-fi and cable ports. It is our gateway, DHCP and DNS servers.

If you need help configuring the network addresses, try this tutorial.

If your interfaces are ready, with addresses configured, and you are able to ping each machine by IP, it is time to set up the names. Edit /etc/hosts file and include the IP address of each machine and its alias. In our cluster we are also including an alias with the domain name. This can be useful with Torque PBS later.

# /etc/hosts 127.0.0.1 master master.tupilabs.com localhost 192.168.0.50 master master.tupilabs.com 192.168.0.51 node01 node01.tupilabs.com

# /etc/hosts.equiv master.tupilabs.com node01.tupilabs.com

# /etc/resolv.conf nameserver 192.168.0.1

Don’t forget to update your hostname and the name server address too.

master hostname

# /etc/hostname master.tupilabs.com

node01 hostname

# /etc/hostname node01.tupilabs.com

If everything worked so far, you must be able to ping each machine by its name, with or without the domain name. If you need help configuring DNS in your linux, take a look at this tutorial. Furthermore, if you are using a notebook with Wi-fi, you may have two interfaces in your Linux. Sometimes the route table is not configured correctly for your network. Try running route -n and have a look at its output to make sure it’s all right too.

SSH

If you use GitHub or BitBucket, probably you already own some SSH keys. Otherwise, you will need to create a SSH key pair. We will use SSH for administering our nodes and for executing jobs later.

ssh-keygen -t rsa

Run the command above to create a RSA key pair for you. When asked about passphrase, leave it blank. If everything worked fine, you will have your keys created in $HOME/.ssh. Copy the contents of id_rsa.pub to authorized_keys. And then copy all the files within .ssh to node01 machine (under the remote user home directory too).

In order to check if your SSH set up is fine, try running ssh user@host.domain.com. Assuming that user is the user holding the .ssh directory, and host.domain.com is the hostname of the other machine. You should see the terminal at the remote machine. For troubleshooting your SSH key pair, use the GitHub documentation here. It is a good chance to create your first repository there too, if you hadn’t. We will use one later to store our biology data.

NFS

NFS, or Network File System is a service that once installed, can keep two folders synchronized between a NFS server and a NFS client. There are some tutorials that guide you through the steps to use your NFS synced folders to install tools like MrBayes and Structure. In our cluster here at TupiLabs, we won’t use this approach. Specially because our cluster is not homogeneous (yup, different hardware) and because we may need to test different versions on each node for development purpose. With NFS it would get trickier. But maybe you can find it useful trying that approach.

We will use NFS to be able to retrieve the result files from all the slaves (one in our example) from a single directory in the master. It saves some time, but the same could be done with a simple script and scp. But we are lazy.

We won’t cover all the details for installing NFS in a Debian Linux as you may have some troubles setting it up, and we would be only duplicating information available somewhere else (and that is updated there and maybe would take longer to be updated ere) and this could be frustrating for you. Instead, we will give you the tutorial that we used. Here it is: https://help.ubuntu.com/community/NFSv4Howto. And a quick note, in case you find other tutorials telling you to use portmap, however there is a problem with dependencies in Debian repositories. Actually, rpcbind is supposed to replace it. So you don’t need it anyway.

Here we will post our existing configuration too, as that might help you for troubleshooting your set up. At the master, we created /var/lib/biouno (to hold our data files), and /export/biouno. There is a fstab bind from this directory to /export/biouno. And in the slaves we created /export/biouno.

NFS master files

# /etc/exports /export 192.168.0.0/255.255.255.0(rw,no_root_squash,no_subtree_check,crossmnt,fsid=0)

$ /etc/fstab ... ... /var/lib/biouno /export/biouno nonde bind 0 0

NFS slave files

# /etc/fstab 192.168.0.50:/ /export nfs rw,nodev,nosuid,noauto 0 0

$ /etc/rc.local mount 192.168.0.50:/

There are some bug reports regarding Debian with Kernel 3.x freezing with NFS. It happened few times, but the server is using Linux Kernel 3.x, and the slave is using 2.x This is not crucial for BioUno right now, the main objective right now is write the code for MrBayes and Structure in Jenkins, using PBS and MPI.

In the next post we will see how to set up MPI and MrBayes. Stay tuned!