Running Stacks denovo pipeline on a supercomputer (no MySQL) and loading the results into an existing database

In this post you will see a short description of one method to run the denovo pipeline with Stacks on a supercomputer that has no MySQL database, and how to load the results later in another computer with the database installed.

Stacks is a software pipeline for building loci out of a set of short-read sequenced samples. Stacks was developed for the purpose of building genetic maps from RAD-Tag Illumina sequence data, but can also be readily applied to population studies, and phylogeography.

In order to run Stacks in parallel you need OpenMP. What also means that if you need computer power, probably you will want a good amount of memory and many CPU cores. A common case is running Stacks in supercomputers, where you can get nodes with 8, 16 or more CPU’s.

However, it is not normal to have databases in these servers. Usually the job scheduling is controlled via a batch server (like PBS) or some other computer facility management tool. Having a database in the same server, the CPU’s would have to handle hundreds or thousands of connections, what could interfere with the job scheduling

The denovo analysis in Stacks is achieved through the denovo_map.pl Perl script. What this script does, basically, is read a bunch of files (samples) and the output of other programs, call other programs (like ustacks, sstacks, genotypes, etc) and load data into a MySQL database.

The data is load incrementally as the programs are executed. You can disable the MySQL interaction by using the -S flag. But probably you’ll want to load it later. Stacks comes with a very useful web application (written in PHP + MDB2). This web application scans your MySQL server looking for databases like %radtags, and produces a nice interface for mining data. You can sort, browse and filter your sequences, look at SNPs and other useful data.

There is an easy way to load your data into a MySQL database after your analysis finished in the supercomputer (and you have used the -S flag). You can use load_radtags.pl. Here’s a short summary of the actions you would have to follow.

  • Download your samples directory, and the directory where the program saved tags, snps, matches and a lot of other tsv files (including denovo_map.log).
  • Execute load_radtags.pl using the right parameters

You can also use the -d option (dry run). This options prints what is going to be executed, but doesn’t run your analysis. It’s good to use this option when you are not 100% sure about your command. Below is an example of how to use load_radtags.pl. It also includes a call to index_radtags.pl, as you have to call it in order to be able to see your data in the web interface.

$ load_radtags.pl -D pe_radtags -p stacks -b 1 -c -B -e "PE populations RAD" -t population $ index_radtags.pl -D pe_radtags -c -t

If you used the right parameters, and didn’t see errors in the console, your data should be available in the web interface now. Easy, no? :-)