Software Projects
During my years of research and study, I’ve have been involved either as a leader or contributor to the following projects:
Current projects
PhyloSift – Phylogenetic analysis of genomes and metagenomes
PhyloSift is a re-implementation and extension of the original amphora approach to reconstructing metagenome phylogeny and taxonomy. PhyloSift has several new capabilities, including an ability to quickly process reads generated by next-generation sequencers, an expanded set of phylogenetic marker genes that includes the Archaea, and Bayesian phylogenetic placement of reads using the excellent pplacer software.
Mauve – multiple genome alignment
Mauve is a software tool to compute whole genome multiple alignments among bacteria and small eukaryotic genomes (usually no bigger than Drosophila). The software includes a Java-based visualization module and a set of alignment programs written in C++, available for Linux, Mac, and Windows.
Phylogenetic linkage estimation in metagenomes
This project involves a new approach to metagenome analysis relevant for high-throughput sequence data. The goal is to simultaneously estimate which short sequence reads come from the same organism along with their phylogeny, even when the reads do not overlap.
Past projects
libhmsbeagle, also known as the beagle library, is a software development library for calculating phylogenetic likelihoods using a variety of compute hardware types. Currently supported hardware types and features include graphics processing units (GPUs) via CUDA and OpenCL, standard CPUs using SSE for fine-scale parallelism and OpenMP for coarse parallelism, and others. The library works on Linux, Mac, and Windows. Further development on this software is being led by Daniel Ayres.
This software implements a Bayesian model of recombination among bacteria. Starting with whole genome sequences aligned with progressiveMauve, this software can reconstruct the probability distribution over historical recombination events among each lineage. From that probability distribution one can compute quantities of interest, such as the rate of gene flow between particular lineages, which parts of the genome have undergone recombination, and more. We have applied this software to investigate speciation in archaea, the chromosomal structure of recombination in Bacillus, and rates of gene flow between pathogenic and commensal E. coli. This software is being maintained by Dr. Xavier Didelot.
Repeatoire – alignment of interspersed genomic repeats
Software to construct multiple sequence alignments of interspersed repeats directly from raw genomic sequence. This project is led by Dr. Todd Treangen at Johns Hopkins University.
mpiBLAST – open-source parallel BLAST
mpiBLAST is a parallelization of the popular NCBI BLAST for MPI-based compute clusters. When searching large databases, it can yield super-linear speedups. It is extremely flexible, accommodating cluster architectures with and without shared storage and parallel filesystems. It integrates well with most job scheduling systems and has also been extended to grid architectures.
GenoPlast – Bayesian inference of genomic plasticity
Given a Mauve genome alignment, GenoPlast uses a statistical model to infer the baseline rates of gene gain and loss among a group of organisms, along with lineage-specific changes to those rates. Gain and loss are modeled independently. Thus it is possible to detect, for example, a lineage-specific accelerated loss with a constant rate of acquisition that may be characteristic of a recent lifestyle change in bacteria. This project is led by Dr. Xavier Didelot at the University of Warwick.
barphlye – Bayesian rearrangement phylogeny in Yersinia
barphlye supplements the BADGER software to analyze patterns present in ancient genome arrangements. A modified version of BADGER samples reconstructions of inversion phylogeny among a set of related organisms, and barphlye can then detect bias in the reconstructed ancestral genome arrangements. Using barphlye, I was able to discover rearrangement hotspots near the origin of replication in bacterial chromosomes. I was also able to confirm the bias towards “symmetric” inversion in circular bacterial chromosomes. Although Yersinia is in the title, the software can be applied to any bacteria with circular chromosomes, and can even operate on individual linear chromosomes.
Seevolution – a time machine for evolution
Seevolution is an interactive viewer for mutations occurring during genome evolution. The program is written in Java and uses Java3D. This project was developed by Mr. Andres Esteban-Marcos, who was a student I supervised at The University of Queensland.
ZORRO – probabilistic masking for phylogenetics
Despite over 30 years of research, accurate multiple sequence alignment remains a challenge. The number of possible alignments is astronomical and for any given optimality criterion, there are often numerous optimal or nearly optimal alignments. Moreover, alignments are merely a nuisance parameter in analysis of sequence evolution and its constraints. ZORRO attempts to quantify the uncertainty inherent in a given multiple sequence alignment and use knowledge of that uncertainty to improve downstream tasks such as phylogenetic inference. This project is led by Dr. Sourav Chatterji at the University of Davis, California.
GRIL – genome inversion and rearrangement locator
A simplistic tool to detect genome rearrangements in single-copy genomic regions among two or more organisms.
ASAP and the ERIC BRC
ASAP is A Systematic Annotation Package for microbial genomes which is part of the larger Enteropathogen Resource Integration Center project. Together, ERIC and ASAP provide a centralized, web based means to annotate the genomes of enteropathogens and serve as a clearinghouse for all types of annotation and experimental data. ASAP supports a wealth of automated annotation strategies, and its evolution has been described in a series of Nucleic Acids Research papers. ASAP is led by Associate Professor Nicole T. Perna and continues to grow at the Genome Evolution Laboratory since my departure.
libClustalW – a C library for Clustal-W 1.83
The 1.83 release of Clustal-W has been refactored as a C library. It builds in Visual Studio on Windows and on Linux, BSD, and other unices with gcc and automake. Note that the Clustal-W authors have since made a 2.0 release which represents a complete rewrite of the aligner and so this library may soon be obsolete.
Extensions to DualBrothers 1.1
Although I was not involved in the original DualBrothers project, I extended the software to apply it to whole-genome alignments and cases of arbitrary recombination among multiple species. Some of these changes, such as checkpointing, are now available in the newly open-sourced (yay!) DualBrothers java software. The software is available from .
libGenome – a C++ development library
libGenome is an open-source C++ library for reading and writing genome sequence data from common file formats, and also provides functions for basic manipulation of genome sequences. It was designed from the ground-up for speed and efficiency. It is available on sourceforge and also as part of the debian linux distribution.