Posts tagged with bioinformatics

Enhance MySQL management in Stacks (University of Oregon)

May 03, 2013 in ideas, bioinformatics | blog


Stacks processes RAD DNA sequences and produces output, that is displayed on the Web with a bundled PHP+MySQL application. This application is quite useful, especially since it helps developers to analyze large amounts of data.


One limitation though, is that some management tasks may need manual execution of SQL’s in the database. Someone could add few scripts or new features to the Web interface (like a management section?). This could help researchers on their research.

Work on BioJava and BioSQL integration

Mar 26, 2013 in bioinformatics, ideas | blog

BioJava code has two versions, the current one (version 3) and the legacy code. In one of these month’s messages to the BioJava development mailing list:

“Unfortunately, biojava 3 does not have any support for biosql at this stage. If you want to use that, you will have to use the biojava 1.x series…”

BioJava is extremely useful for Java projects, researchers, libraries and tools. And the same can be said about as BioSQL. The integration of these two tools would bring many benefits to the whole community (private companies, researchers, institutions, universities, etc).

Tupilabs Report: Feb 10, Feb 16

Feb 17, 2013 in bioinformatics, biotechnology, biouno, jenkins-en, speak-like-a-brazilian, tupilabs-report | news

Here’s the list of the cool things we did since last Sunday at TupiLabs.

We are working for you

Have a great week! :D

Study if it's doable to parallelize PacBioToCA

Feb 05, 2013 in bioinformatics, ideas | blog is an amazing site. The content is clear, rich and very entertaining (at least for those who like biology and technology). They have an active Twitter account, they answer comments, they are on reddit too. These guys rock!

Recently they wrote about PacBioToCA, pointing out its lengthy execution. It’s common in biotechnology parallel execution with OpenMPI, OpenMP or other method like MapReduce.

One idea would be review the code, check if it’s doable some kind of parallelism and write about it. Perhaps rewriting it. They already wrote about another tool that competes with PacBioToCA. I told you that is amazing.

Running Stacks denovo pipeline on a supercomputer (no MySQL) and loading the results into an existing database

Nov 06, 2012 in bioinformatics | tutorials

In this post you will see a short description of one method to run the denovo pipeline with Stacks on a supercomputer that has no MySQL database, and how to load the results later in another computer with the database installed.

Stacks is a software pipeline for building loci out of a set of short-read sequenced samples. Stacks was developed for the purpose of building genetic maps from RAD-Tag Illumina sequence data, but can also be readily applied to population studies, and phylogeography.

In order to run Stacks in parallel you need OpenMP. What also means that if you need computer power, probably you will want a good amount of memory and many CPU cores. A common case is running Stacks in supercomputers, where you can get nodes with 8, 16 or more CPU’s.

However, it is not normal to have databases in these servers. Usually the job scheduling is controlled via a batch server (like PBS) or some other computer facility management tool. Having a database in the same server, the CPU’s would have to handle hundreds or thousands of connections, what could interfere with the job scheduling

The denovo analysis in Stacks is achieved through the Perl script. What this script does, basically, is read a bunch of files (samples) and the output of other programs, call other programs (like ustacks, sstacks, genotypes, etc) and load data into a MySQL database.

The data is load incrementally as the programs are executed. You can disable the MySQL interaction by using the -S flag. But probably you’ll want to load it later. Stacks comes with a very useful web application (written in PHP + MDB2). This web application scans your MySQL server looking for databases like %radtags, and produces a nice interface for mining data. You can sort, browse and filter your sequences, look at SNPs and other useful data.

There is an easy way to load your data into a MySQL database after your analysis finished in the supercomputer (and you have used the -S flag). You can use Here’s a short summary of the actions you would have to follow.

  • Download your samples directory, and the directory where the program saved tags, snps, matches and a lot of other tsv files (including denovo_map.log).
  • Execute using the right parameters

You can also use the -d option (dry run). This options prints what is going to be executed, but doesn’t run your analysis. It’s good to use this option when you are not 100% sure about your command. Below is an example of how to use It also includes a call to, as you have to call it in order to be able to see your data in the web interface.

$ -D pe_radtags -p stacks -b 1 -c -B -e "PE populations RAD" -t population $ -D pe_radtags -c -t

If you used the right parameters, and didn’t see errors in the console, your data should be available in the web interface now. Easy, no? :-)