A software that accurately infers tumor purity and ploidy. It differentiates from others by performing well in low-purity and low-coverage samples.


Check Accurity  for more details. Publication pending.



A software package that stratifies patients based on patient biomarker data and drug response data.


Check this page for more details.


Parallel workflow to analyze the NGS data ( Huang et al. 2015)

Figure 1.1 The left panel is a job-dependency DAG (direct acyclic graph). The right panel is the job-duration vs. time diagram, illustrating which job takes most time and etc. Both panels are from a toy example. The real data workflows involve 100X or more jobs.

A workflow that analyzes ~900 monkey genomes (cumulative coverage ~4000). Starting from billions of 100bp paired-end reads by Illumina GenomeAnalyzer II, the whole workflow is comprised of several different sub-workflows: the read filtering sub-workflow (main program is a self-written java program based on GATK libraries), the read alignment sub-workflow (main program is bwa [Li et al. 2009], stampy[Lunter et al. 2011] used in test), the base-quality-score-recalibration sub-workflow by GATK [DePristo et al. 2011], the genotype-calling sub-workflow by SAMtools [Li et al. 2009] and GATK [DePristo et al. 2011], the pedigree calling sub-workflow (main program is TrioCaller [Chen et al. 2012]), and other sub-workflows that carry out the variant-filtering and statistics-calculation (Transition/Transversion, allele-frequency, mendelian inconsistency, population genetic measures such as nucleotide diversity, hardy-weinberg equilibrium p-value, linkage disequilibrium). All workflows interact with the vervet postgresql database seamlessly through sqlalchemy/elixir. The workflows were constructed in a MapReduce manner using APIs from the Pegasus workflow management system to take full advantage of the parallel computing power in most clusters. The end-result is an extremely empowering and flexible system that turns thousands of jobs into one single job, and a whole cluster into one single computer. The main code could be found at Substantial java and C++ code are in private git repositories (available upon request).

Arabidopsis GWAS web app and database (Seren et al. 2012Huang et al. 2011Atwell et al. 2010)

The most update URL is at (old links from will be re-directed to this GMI site). The MySQL database dump could be downloaded from All the code for the version demonstrated in Huang et al. 2011 could be found in this tarball (Pylons web server, web client using Google web toolkit, etc.).

A second-generation version is at (Seren et al. 2012, source code link). The Arabidopsis polymorphism effort is at

github homepage: