TupiLabs

Work on BioJava and BioSQL integration

Mar 26, 2013 in bioinformatics, ideas | blog

BioJava code has two versions, the current one (version 3) and the legacy code. In one of these month’s messages to the BioJava development mailing list:

“Unfortunately, biojava 3 does not have any support for biosql at this stage. If you want to use that, you will have to use the biojava 1.x series…”

BioJava is extremely useful for Java projects, researchers, libraries and tools. And the same can be said about as BioSQL. The integration of these two tools would bring many benefits to the whole community (private companies, researchers, institutions, universities, etc).

Add stemming to Cogroo

Mar 19, 2013 in cogroo, ideas, nlp | blog

Cogroo is used by OpenOffice’s spell and grammar checker, as well as in some other solutions. At moment it doesn’t support stemming (see today’s mailing list posts about this), but other tools like nltk do. There’s also a nice Wiki page about it.

Exporting data in R format

Feb 09, 2013 in jenkins, r, ideas | blog

Quandl has now a feature that lets users export data as an R matrix, this way the data can be easily loaded into R. Damn awesome righto? Kudos to the guys from Quandl!

read.csv('http://www.quandl.com/api/v1/datasets/OFDP/ALUMINIUM_21.csv?&utf8=â&trim_start=2012-01-03&trim_end=2013-02-07&sort_order=desc', colClasses=c('Date'='Date'))

Today’s idea is use Quandl approach in other applications. Jenkins exports some of its objects to its external JSON API. And there are stats being collected from plug-ins (like the download statistics). I am curious to know if there would have any gain in exporting parts of this data as R matrices and processing it directly with R.

This could be used in a lot of different applications, perhaps even using Hadoop and R. Instead of writing collectors to consume an RESTful service, you would simply write one R line to get the job done.

R does very fine in plotting too, so I wouldn’t be surprised to at least see some prettier charts :o)

Study if it's doable to parallelize PacBioToCA

Feb 05, 2013 in bioinformatics, ideas | blog

Homolog.us is an amazing site. The content is clear, rich and very entertaining (at least for those who like biology and technology). They have an active Twitter account, they answer comments, they are on reddit too. These guys rock!

Recently they wrote about PacBioToCA, pointing out its lengthy execution. It’s common in biotechnology parallel execution with OpenMPI, OpenMP or other method like MapReduce.

One idea would be review the code, check if it’s doable some kind of parallelism and write about it. Perhaps rewriting it. They already wrote about another tool that competes with PacBioToCA. I told you that Homolog.us is amazing.

Create a tool to find out the average test coverage in Jenkins plug-ins

Nov 24, 2012 in jenkins, testing, ideas | blog

This idea popped up while chatting with Richard Lavoie. I was telling him about selenium tests for plug-ins, snakebite one thing leads to another, and then came the idea of measuring the coverage in plug-ins.

Basically, this tool would have to iterate over the 600+ plug-ins, and trigger a mvn test command. Then it would have to save the coverage information somewhere, and plot a graph, or generate some report about it.

Depending on the machine, probably this tool could spawn more than 1 thread for doing this. Even then, it would still take quite a while for processing all the plug-ins.

Posts tagged with ideas

Work on BioJava and BioSQL integration

Add stemming to Cogroo

Exporting data in R format

Study if it's doable to parallelize PacBioToCA

Create a tool to find out the average test coverage in Jenkins plug-ins

Resources

Follow Us

Check out our lab!