Posts in tutorials

Running Stacks denovo pipeline on a supercomputer (no MySQL) and loading the results into an existing database

Nov 06, 2012 in bioinformatics | tutorials

In this post you will see a short description of one method to run the denovo pipeline with Stacks on a supercomputer that has no MySQL database, and how to load the results later in another computer with the database installed.

Stacks is a software pipeline for building loci out of a set of short-read sequenced samples. Stacks was developed for the purpose of building genetic maps from RAD-Tag Illumina sequence data, but can also be readily applied to population studies, and phylogeography.

In order to run Stacks in parallel you need OpenMP. What also means that if you need computer power, probably you will want a good amount of memory and many CPU cores. A common case is running Stacks in supercomputers, where you can get nodes with 8, 16 or more CPU’s.

However, it is not normal to have databases in these servers. Usually the job scheduling is controlled via a batch server (like PBS) or some other computer facility management tool. Having a database in the same server, the CPU’s would have to handle hundreds or thousands of connections, what could interfere with the job scheduling

The denovo analysis in Stacks is achieved through the denovo_map.pl Perl script. What this script does, basically, is read a bunch of files (samples) and the output of other programs, call other programs (like ustacks, sstacks, genotypes, etc) and load data into a MySQL database.

The data is load incrementally as the programs are executed. You can disable the MySQL interaction by using the -S flag. But probably you’ll want to load it later. Stacks comes with a very useful web application (written in PHP + MDB2). This web application scans your MySQL server looking for databases like %radtags, and produces a nice interface for mining data. You can sort, browse and filter your sequences, look at SNPs and other useful data.

There is an easy way to load your data into a MySQL database after your analysis finished in the supercomputer (and you have used the -S flag). You can use load_radtags.pl. Here’s a short summary of the actions you would have to follow.

  • Download your samples directory, and the directory where the program saved tags, snps, matches and a lot of other tsv files (including denovo_map.log).
  • Execute load_radtags.pl using the right parameters

You can also use the -d option (dry run). This options prints what is going to be executed, but doesn’t run your analysis. It’s good to use this option when you are not 100% sure about your command. Below is an example of how to use load_radtags.pl. It also includes a call to index_radtags.pl, as you have to call it in order to be able to see your data in the web interface.

$ load_radtags.pl -D pe_radtags -p stacks -b 1 -c -B -e "PE populations RAD" -t population $ index_radtags.pl -D pe_radtags -c -t

If you used the right parameters, and didn’t see errors in the console, your data should be available in the web interface now. Easy, no? :-)

Desenvolvendo plug-ins para o Jenkins: Hospedando plug-ins no jenkins-ci.org e a release de um plug-in

Oct 20, 2012 in jenkins | tutorials

No ar o terceiro vídeo sobre desenvolvimento de plug-ins para o Jenkins. Neste vídeo é explicado como fazer para hospedar plug-ins Open Source no jenkins-ci.org. No final do vídeo a versão 1.6 do plug-in Jenkins TAP Plug-in é lançada. Esperamos que mostrando como é simples e fácil lançar um plug-in mais pessoas se sintam motivadas para contribuir com seus plug-ins.

Bom divertimento :D

Integrando Nutch 2.x, MySQL e Solr

Sep 15, 2012 in nutch | tutorials

Esse post é uma tradução do post: http://www.kinoshita.eti.br/2012/09/14/integrating-nutch-2-x-mysql-and-solr/

No momento estamos trabalhando em um projeto usando Apache Nutch 2.x, Apache Hadoop, Apache Solr 4 e um monte de outras ferramentas/módulos/API’s/etc legais. Depois de seguir as instruções encontradas em http://nlp.solutions.asia/?p=180, consegui conectar Apache Nutch, MySQL e Apache Solr.

mysql_hadoop_solr_nutch

Resumindo:

  • Criar um banco de dados para guardar seus dados
  • Usar SQLDataStore e adicionar a configuração para seu servidor MySQL
  • Atualizar a configuração do Apache Nutch
  • Atualizar o esquema Solr

Agora nosso Apache Nutch usa MySQL como data store (o local onde se armazena o resultados do processo de crawling, como a URL, conteúdo, texto, metadata, e assim por diante). Isso é ótimo, mas há mais um passo faltando no esquema Solr disponibilizado no post do blog.

Devido ao bug SOLR-3432, depois de seguir o tutorial e substituir o esquema original, você não conseguirá deletar todo o índice. Depois de seguir as instruções nos comentários do bug, e adicionar a seguinte entrada no arquivo schema.xml isso voltou a funcionar.

<field name="version" type="long" indexed="true" stored="true"/>

Reinicie o Apache Solr e execute o seguinte comando e o seu índice estará reiniciado.

curl http://localhost:8983/solr/collection1/update?commit=true -H "Content-Type: text/xml" --data-binary ":"

Espero que ajude se você estiver criando um ambiente semelhante.

Até mais! -B