Please use Accucopy!
20191023 Our latest software Accucopy builds upon, improves, and expands Accurity funtionality. Please use Accucopy instead.
Accurity is a computational method that infers tumor purity and tumor cell ploidy from tumor-normal WGS (whole exome may work too) data by jointly modelling SCNAs and heterozygous germline single-nucleotide-variants (HGSNVs). Results from both in silico and real sequencing data demonstrated that Accurity is highly accurate and robust, even in low-purity, high-ploidy, and low-coverage (as low as 1X) settings in which several existing methods perform poorly. Accounting for tumor purity and ploidy, Accurity significantly increased the signal/noise gaps between different copy numbers.
- Z. Luo*, X. Fan*, Y. Su, YS. Huang (2018). Accurity: Accurate tumor purity and ploidy inference from tumor-normal WGS data by jointly modelling somatic copy number alterations and heterozygous germline single-nucleotide-variants. Bioinformatics. PDF
The license follows our institute policy that you can use the program for free as long as you are using Accurity strictly for non-profit research purposes. However, if you plan to use Accurity for commercial purposes, a license is required and please contact email@example.com to obtain one.
The full-text of the license is included in the software package.
Get our software
2019/9/10: The registration site is up and running.
2019/8/26: Some critical bug fixes have been pushed into the docker version. Please update! The registration site will be down for a few days for server migration.
2019/3/11: We made available two versions of reference packages for different versions of the human genome (hs38, hs37). Software was also changed a bit. So please pull the latest docker image.
2019/1/11: We replaced Freebayes with Strelka2 to call SNPs. The latter is faster (chromosome-level parallel) and more accurate. Strelka2 is included in the docker image, at /usr/local/strelka. NO need to install it.
Register to receive updates.
Please register here to receive updates. The download link included in the email is a standalone Accurity package without dependencies. If you have trouble installing 3.1, use the docker version instead.
NOTE Due to the difficulty (i.e. no root access to install required libraries or incompatible libraries) in running our binary software, we have made a docker image available at https://hub.docker.com/r/polyactis/accurity, which contains the latest development version of our software and all dependent libraries. Accurity inside the docker is alwasy ahead of what can download from this website and the old source code at https://github.com/polyactis/Accurity.
- Install Ubuntu package "docker.io" before you do anything below.
- Download the refData package from section refData.
- To run it on a HPC cluster, singularity might be a better fit than docker.
Install piece by piece
- A computer with at least 32GB of memory (recommend 64GB).
- Freebayes (A pre-compiled binary for Ubuntu 16.04). A variant caller that is used to call SNPs.
- Pyflow https://github.com/Illumina/pyflow
- samtools (A pre-compiled binary for Ubuntu 16.04)
- libbz2-1.0 (a high-quality block-sorting file compressor library, install it via "apt install libbz2-1.0" in Debian/Ubuntu)
- If your OS (like CentOS) has this library installed but Accurity still fails to load it, you can do a symlink from the installed libarary file to "libbz2.so.1.0".
- liblzma5 (XZ-format compression library)
- (Only for building from source) pkg-config: used by Rust compiler to find library paths. i.e. "pkg-config --libs --cflags openssl"
- (Optional) R packages ggplot2, grid, scales. Only needed if you obtain a development version of Accurity. Required to make one R plot.
- But the R plot is NOT a must-have, one python plot has similar content as the R plot.
Running Accurity requires a project-specific configure file, details below. configure according to your OS environment.
Install pyflow and other Python packages
Other python packages can be installed through Python package system "pip install ..." or Ubuntu package system, dpkg/apt-get.
Register to download the Accurity binary package and receive update emails.
Please register here to receive an email that contains a download link. After finishing download, unpack the package via this:
Accurity is a package that contains a few binary executables and R/Python scripts. All binary executables were compiled for a Linux platform (Ubuntu 14 and 16 tested). It also contains a sample configure file. Denote the full path of the Accurity folder as accurity_path in the configure file (described below).
- If you are having difficulty in getting Accurity to work, please use docker instead.
- This binary package is behind our docker release.
Compile source code (for advanced users)
Instead of downloading binary, you can also choose to compile the source code. Be forewarned, you may run into problems (missing packages, wrong paths, etc.) in compiling the C++ portion on non-Ubuntu platforms. Rust compiling is relatively easy.
Compiling Accurity requires those "lib..." packages in section 3.1 and their corresponding development packages (for example, libbz2-dev). In addition, it requires an installation of Rust, https://www.rust-lang.org/. We have compiled successfully on Ubuntu 16.04 and 18.04.
- The public source code on github (https://github.com/polyactis/Accurity) might be older than the most recent development version. We advise users to use the latest version that is encapsulated in the docker.
The reference genome package
The reference genome package is one of the required inputs of Accurity. The sub-folder, refData/1000g/, contains coordinates of common (allele frequency >10%) SNPs from the 1000 Genomes project. The chromosome coordinates are denoted as "chr1", not "1". We advise users to align reads against the genome file included in the package to re-generate their bam files, in order to minimize wrong alignments and more importantly, match the coordinates of the 1000Genomes SNP file. We provide two downloads for different versions of the human genome.
- hs38d1.7z (714MB, NCBI hs38 is equivalent to UCSC hg20)
- hs37d5.7z (718MB, NCBI hs37 is equivalent to UCSC hg19)
We use the 7z compressor. Run "7z x hs38d1.7z" to extract all files.
Make your own reference genome package
Accucopy (Accurity 2.0) can handle non-human reference genomes. Please check Accucopy.
Input bam files
For an example, you have a pair of matched tumor and normal samples.
*.bam.bai files (bam index) are not required. Accurity will call samtools to generate them if they are found to be missing.
Setup the configure file (latest format, as in the docker version)
Copy the sample configure file (tab-delimited) from the Accurity package into your project folder and modify it accordingly. An example looks like this:
All the fields in the configure file:
read length the length in base pair of the read.
window_size the window size in base pair for segmentation. The segmentation program (GADA) first calculates the number of reads for each window and then perform segmentation over the genome. A small window size often leads to a large number of small segments. The recommended window size is 500bp.
reference_folder_path the path of the genome folder. It contains 3 files (hs.dict, hs.fa, hs.fa.fai) and one folder "1000g". The subdirectory "1000g" contains the 1000 genome bi-allele SNPs file, downloadable from this site.
samtools_path the path of the samtools binary
strelka_path the path of the 3rd-party variant calling program. For freebayes, it's the path to the binary. For Strelka2, it is the path of the folder that contains all Strelka2 code/executables, i.e. /usr/local/strelka.
accurity_path the path of the Accurity software. See section 3.2
Accurity consists of several binary executables. To make everything easy, we have written a Python program main.py ( inside the "Accurity" folder ) which wraps all binary executables in a workflow.
./main.py –h gives you an explanation of all the arguments:
In the debug mode (-d 1), Accurity will produce several intermediate plots, offering insights into how well it is handling the input data.
Accurity contains 7 major components. First, it will check whether the bam index files exist, if not, Accurity will create them. Then, it carries out SNP calling and the coverage normalization (in parallel, one job per chromosome). Next, call heterozygous SNPs and segment each chromosome (in parallel, one job per chromosome). After that, Accurity infers the purity and ploidy. Last, it will plot some results. The whole workflow structure is as follows.
These are the output files that matters.
A summary output that contains purity and ploidy estimates, and some other statistical measures. Probably the most important file to a user.
This contains lots of internal model output, useful for developers.
This contains preliminary copy number alteration predictions.
The important columns are chr, start, end.
"cp" is the predicted copy number.
"copy_no_float" is the raw copy number outputted by our program, which will be converted to an integer (the "cp" column) if our model deems it a clonal (shared by all cancer cells) CNV. Some "cp" will stay as "float" because our model thinks they are subclonal (some cancer cells in one CNV state, some cancer cells in another).
A genome-wide CNV plot.
A period plot. Check if the model fits data well.
A plot for developers.
A clean-data example
All major results are stored in the output directory. File sample_1_infer/infer.out.tsv contains the purity and ploidy estimates. Here is an example. (Viewing in Excel is a lot nicer) :
In the output above, the column ‘purity’ is the final purity estimate, and ‘purity_naive’ is the pre-adjusted estimate which can be ignored. ‘logL’ is the maximum likelihood of the hierarchical Gaussian Mixture model. ‘period’ is the 1000 X period of the tumor-read-enrichment (TRE) histogram (=333 in this case), which is detected by auto-regression. 'no_of_snps' and ‘no_of_segments’ is the result of step 3 and step 4. Other columns are values of commandline arguments.
There are other important output files, such as all_segments.tsv.gz and het_snp.tsv.gz, which are output of step 4 and step 3 respectively. If the sample is abnormal, we can usually see an unreasonable number of segments and SNPs in these two files.
Besides the text output, Accurity will produce some graphic output. One of the the most important plots is plot.tre.png, available only in debug mode ( -d 1):
TRE stands for Tumor Read Enrichment. You can think of it as a normalized version of the read count ratio between the tumor and normal samples for one chromosome window. More details can be found in our paper. The Y axis in the two panels is the window count. The lower panel is in the log scale. A clean TRE histogram leads to a confident purity estimate.
In this clean-data example, the tumor read enrichment (TRE) histogram displays a beautiful periodic pattern. That means we can confidently infer the period (=0.333) from the TRE data and the ensuing maximum likelihood estimates will be more robust. The CNV estimates (a by-product of Accurity), in plot.cnv.png, also demonstrates a clean copy number variation (CNV) profile.
This is the estimated CNV profile for the example. The top plot is the estimated absolute copy number for each segment. For a normal sample, the absolute copy number should be 2 throughout the genome. The lower plot shows the major allele copy number for each segment.
There are cases where purity and ploidy can not be inferred:
- The cancer genome contains too few somatic copy number alterations.
- The noise level is too high, or the noise level is moderate but the sample purity is very low (<0.05).
A noisy-data example
Occasionally, a user will encounter extremely noisy data. The user should learn to identify the noisy data from plots and do NOT use the estimates made by Accurity. In the future, we will probably stop Accurity from making any estimate. But for the time being, here is a noisy-data example.
Content of infer.out.tsv for a noisy-data example. The high number of segments is a red flag.
Its tumor-read-enrichment (TRE) histogram (plot.tre.png) has one big and unclean peak (its landscape looks like being cut through by a lousy jigsaw). It makes it really difficult to accurately estimate its period. The period estimate (0.468, 468 in the 2nd cell of the 4th line is 1000Xperiod.) is probably far from the truth. All ensuing maximum likelihood estimates are questionable. The estimated CNV profile further confirms the great amount of noise in this data.
"At genomic regions of the first type, all cancer subclones have the same integral copy number....we call these regions clonal."
Why is the copy number integral?
Copy number estimate is an integer. like 1, 2, 3. sounds obvious, but some regions are of fractional (2.3, 3.5) copy numbers because 1) your sequencing data is a mixture of thousands or even millions of cells , which is called batch sequencing (not single-cell sequencing), 2) a tumor is usually not homogeneous, so some cancer cells differ from others in terms of copy numbers. For example, in one region, 50% cells is of copy number 2, the other 50% is of copy number 3. Then you'll see 2.5 as a whole. These regions are called subclonal.
"How does Accurity deal with multiple subclones(>2)?"
We don't estimate the number (or fractions) of cancer subclones. Our software tells the user whether a region is clonal or subclonal, and estimate their copy numbers.
Please note. "subclone" and "subclonal" are referring to different things. "subclone" or "clone" refers to a lineage of cancer cells during the cancer cell evolution process. "subclonal" or "clonal" is referring to mutations. "Subclonal" mutations are the ones that lead to a type of cancer subclones on the evolutionary branch, and thus these are the mutations that not shared across all cancer cells in the tumor sample. "Clonal" mutations are the ones that are shared across all cancer cells.
I think lots of people get confused about it.