Creating a PBS/MPI cluster for bioinformatics – Part 2

In the previous post of this series we saw how to configure a basic network for our small cluster. Now it is time to work on the MPI stuff. Our cluster will be a Beowulf cluster. This kind of cluster is composed of commodity computers, connected via network sharing resources and programs. And in the next post, we will see how to include a batch queuing system to control the resource utilization in our cluster.

MPI is not a library. MPI is actually a standard. When you find an RESTful application, REST is only a standard, and you have different libraries that implement this standard (Jersey, JBoss Reast Easy, CodeIgniter+Rest Controllers, and so it goes). With MPI it is not different. There are different implementations of the MPI specification. We suggest you to read this tutorial that, although it was written in 2009, it uses Debian (operational system we are using in BioUno cluster) and is very well written and concise. Basically, you have to install OpenMPI, one of the existing MPI libraries. And if you followed the instructions in part 1 of this series, then you already have SSH correctly configured in your computers.

# /etc/mpi_hostfile # Here, the master is a quad-core, run 'grep processor /proc/cpuinfo' to see yours cpu info localhost slots=6 # Slaves # che is a dual-core che slots=2

Installing MrBayes with MPI support and Beagle Lib

Now that you already have MPI installed and have created a MPI host file, as seen above, lets install MrBayes. But first, Beagle. Beagle is a “high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages“. Basically, Beagle has some highly optimized code for calculations used in MrBayes and other tools, and can also use GPUs. Unfortunately our cluster has only cheap graphic video cards, but in the future we will try extending this series to add a comparison using GPUs.

Follow the instructions from Beagle website to install it in your Linux machine. Don’t forget to check that it found your Java or GPU (in case you are using it) and also check your shared libraries and all before going to the next step.

Download MrBayes tar.gz file from its website, decompress it and go to the src folder. You can simply run autoconf, ./configure --enable-mpi=yes and make. If everything works out fine, you will have the executable mb ready to be executed. Move this executable to your NFS exported directory, so that the slave nodes can access it too.

Running your first analysis with MrBayes

Now you are almost done. Download our example Nexus file prepared to run MrBayes as batch from https://github.com/tupilabs/biology-data/blob/master/mrbayes/primates.nex and execute the following command.

/export/biouno/mrbayes_3.2.0/src/mb /export/biouno/biology-data/mrbayes/primates.nex

You should see MrBayes analysis output, as well as a primates.nex.p and primates.nex.t files, containing information on probability and phylogenetic trees. Now lets try running MrBayes in parallel.

mpirun -np 3 --hostfile /etc/mpi_hostfile /export/biouno/mrbayes_3.2.0/src/mb /export/biouno/biology-data/mrbayes/primates.nex < /dev/null > log.txt

In the example above, MPI will execute MrBayes only in the master, as we required 3 processors, and the master has 4. You can try increasing the number of processors, but you will have to increase the number of chains in the Nexus file as well. The focus in BioUno is the Java interface to call MrBayes and control the cluster from within Jenkins. So benchmarking is not being discussed here. However, for what it is worth, running in 1 processor the analysis took approximately 3.5 seconds and less than 1 second using 3 processors.

In the next part we will configure the job queuing system to run structure. Stay tuned.