Page tree
Skip to end of metadata
Go to start of metadata

Accurity (Luo et al. 2018)

A software that infers the tumor purity and ploidy from a pair of tumor-normal whole-genome sequencing data. It differentiates from others by performing well in low-purity and low-coverage samples. 


目前研发计算癌症clonal evolution 的算法。

Check the Accurity page for more details.


To access the bioinformatics data generated in the 个性化药物先导专项, please login It contains >100TB NGS genomics, transcriptomics, histo-imaging, and proteomics data, generated during the drug development pipeline.


PatientStratifier is a software package that stratifies patients based on patient biomarker data and drug response data. Its core is a machine learning module that learns from existing patient biomarker and drug response data.  It also has a component called PatientRecommender that recommends if a patient should be given a drug or not based on its biomarker data.

Contact us for more details.

Parallel workflow to analyze the NGS data (Huang et al. 2015)

A workflow that analyzes ~900 genomes (cumulative coverage ~4000). Starting from billions of 100bp paired-end reads by Illumina GenomeAnalyzer II, the whole workflow is comprised of several different sub-workflows: the read filtering sub-workflow (whose main program is a custom-written java program based on GATK libraries), the read alignment sub-workflow (main program is bwa [Li et al. 2009], stampy[Lunter et al. 2011] used in test), the base-quality-score-recalibration sub-workflow by GATK [DePristo et al. 2011], the genotype-calling sub-workflow by SAMtools [Li et al. 2009] and GATK [DePristo et al. 2011], the pedigree calling sub-workflow (main program is TrioCaller [Chen et al. 2012]), and other sub-workflows that carry out the variant-filtering and statistics-calculation (Transition/Transversion, allele-frequency, Mendelian inconsistency, population genetic measures such as nucleotide diversity, Hardy-Weinberg equilibrium p-value, linkage disequilibrium). All workflows interact with the vervet postgreSQL database seamlessly through sqlalchemy/elixir. The workflows were constructed in a MapReduce manner using APIs from the Pegasus workflow management system to take full advantage of the parallel computing power in most clusters. The end-result is a powerful and flexible system that is capable of utilizing the full power of a computing cluster. The main code could be found at Substantial java and C++ code are in private git repositories (contact if interested).

This is a job-dependency DAG (direct acyclic graph).

This is the job-duration vs. time diagram, illustrating which job takes most time, from a toy example. The real-data workflows involve 100X or more jobs.

Arabidopsis GWAS web app and database (Seren et al. 2012Huang et al. 2011Atwell et al. 2010)

The most update URL is at (old links from will be re-directed to this GMI site). The MySQL database dump could be downloaded from All the code for the version demonstrated in Huang et al. 2011 could be found in this tarball (Pylons web server, web client using Google web toolkit, etc.).

A second-generation version is at (Seren et al. 2012, source code link). The Arabidopsis polymorphism effort is at

github homepage:

  • No labels