WGS Assembly evaluation with QUAST¶

We will evaluate our WGS assembly with the tool QUAST.

QUAST stands for QUality ASsessment Tool. The tool evaluates genome assemblies by computing various metrics. You can find all project news and the latest version of the tool at sourceforge. QUAST utilizes MUMmer, GeneMarkS, GeneMark-ES, GlimmerHMM, and GAGE.

Here is the usage for quast.py:

Usage: python /usr/local/bin/quast.py [options] <files_with_contigs>

Options:
-o  --output-dir  <dirname>       Directory to store all result files [default: quast_results/results_<datetime>]
-r                <filename>      Reference genome file
-g  --features [type:]<filename>  File with genomic feature coordinates in the reference (GFF, BED, NCBI or TXT)
                                  Optional 'type' can be specified for extracting only a specific feature type from GFF
-m  --min-contig  <int>           Lower threshold for contig length [default: 500]
-t  --threads     <int>           Maximum number of threads [default: 25% of CPUs]

Advanced options:
-s  --split-scaffolds                 Split assemblies by continuous fragments of N's and add such "contigs" to the comparison
-l  --labels "label, label, ..."      Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes
-L                                    Take assembly names from their parent directory names
-e  --eukaryote                       Genome is eukaryotic (primarily affects gene prediction)
    --fungus                          Genome is fungal (primarily affects gene prediction)
    --large                           Use optimal parameters for evaluation of large genomes
                                      In particular, imposes '-e -m 3000 -i 500 -x 7000' (can be overridden manually)
-k  --k-mer-stats                     Compute k-mer-based quality metrics (recommended for large genomes)
                                      This may significantly increase memory and time consumption on large genomes
    --k-mer-size                      Size of k used in --k-mer-stats [default: 101]
    --circos                          Draw Circos plot
-f  --gene-finding                    Predict genes using GeneMarkS (prokaryotes, default) or GeneMark-ES (eukaryotes, use --eukaryote)
    --mgm                             Use MetaGeneMark for gene prediction (instead of the default finder, see above)
    --glimmer                         Use GlimmerHMM for gene prediction (instead of the default finder, see above)
    --gene-thresholds <int,int,...>   Comma-separated list of threshold lengths of genes to search with Gene Finding module
                                      [default: 0,300,1500,3000]
    --rna-finding                     Predict ribosomal RNA genes using Barrnap
-b  --conserved-genes-finding         Count conserved orthologs using BUSCO (only on Linux)
    --operons  <filename>             File with operon coordinates in the reference (GFF, BED, NCBI or TXT)
    --est-ref-size <int>              Estimated reference size (for computing NGx metrics without a reference)
    --contig-thresholds <int,int,...> Comma-separated list of contig length thresholds [default: 0,1000,5000,10000,25000,50000]
    --x-for-Nx <int>                  Value of 'x' for Nx, Lx, etc metrics reported in addition to N50, L50, etc (0, 100) [default: 90]
-u  --use-all-alignments              Compute genome fraction, # genes, # operons in QUAST v1.* style.
                                      By default, QUAST filters Minimap's alignments to keep only best ones
-i  --min-alignment <int>             The minimum alignment length [default: 65]
    --min-identity <float>            The minimum alignment identity (80.0, 100.0) [default: 95.0]
-a  --ambiguity-usage <none|one|all>  Use none, one, or all alignments of a contig when all of them
                                      are almost equally good (see --ambiguity-score) [default: one]
    --ambiguity-score <float>         Score S for defining equally good alignments of a single contig. All alignments are sorted
                                      by decreasing LEN * IDY% value. All alignments with LEN * IDY% < S * best(LEN * IDY%) are
                                      discarded. S should be between 0.8 and 1.0 [default: 0.99]
    --strict-NA                       Break contigs in any misassembly event when compute NAx and NGAx.
                                      By default, QUAST breaks contigs only by extensive misassemblies (not local ones)
-x  --extensive-mis-size  <int>       Lower threshold for extensive misassembly size. All relocations with inconsistency
                                      less than extensive-mis-size are counted as local misassemblies [default: 1000]
    --scaffold-gap-max-size  <int>    Max allowed scaffold gap length difference. All relocations with inconsistency
                                      less than scaffold-gap-size are counted as scaffold gap misassemblies [default: 10000]
    --unaligned-part-size  <int>      Lower threshold for detecting partially unaligned contigs. Such contig should have
                                      at least one unaligned fragment >= the threshold [default: 500]
    --skip-unaligned-mis-contigs      Do not distinguish contigs with >= 50% unaligned bases as a separate group
                                      By default, QUAST does not count misassemblies in them
    --fragmented                      Reference genome may be fragmented into small pieces (e.g. scaffolded reference)
    --fragmented-max-indent  <int>    Mark translocation as fake if both alignments are located no further than N bases
                                      from the ends of the reference fragments [default: 85]
                                      Requires --fragmented option
    --upper-bound-assembly            Simulate upper bound assembly based on the reference genome and reads
    --upper-bound-min-con  <int>      Minimal number of 'connecting reads' needed for joining upper bound contigs into a scaffold
                                      [default: 2 for mate-pairs and 1 for long reads]
    --est-insert-size  <int>          Use provided insert size in upper bound assembly simulation [default: auto detect from reads or 255]
    --plots-format  <str>             Save plots in specified format [default: pdf].
                                      Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz
    --memory-efficient                Run everything using one thread, separately per each assembly.
                                      This may significantly reduce memory consumption on large genomes
    --space-efficient                 Create only reports and plots files. Aux files including .stdout, .stderr, .coords will not be created.
                                      This may significantly reduce space consumption on large genomes. Icarus viewers also will not be built
-1  --pe1     <filename>              File with forward paired-end reads (in FASTQ format, may be gzipped)
-2  --pe2     <filename>              File with reverse paired-end reads (in FASTQ format, may be gzipped)
    --pe12    <filename>              File with interlaced forward and reverse paired-end reads. (in FASTQ format, may be gzipped)
    --mp1     <filename>              File with forward mate-pair reads (in FASTQ format, may be gzipped)
    --mp2     <filename>              File with reverse mate-pair reads (in FASTQ format, may be gzipped)
    --mp12    <filename>              File with interlaced forward and reverse mate-pair reads (in FASTQ format, may be gzipped)
    --single  <filename>              File with unpaired short reads (in FASTQ format, may be gzipped)
    --pacbio     <filename>           File with PacBio reads (in FASTQ format, may be gzipped)
    --nanopore   <filename>           File with Oxford Nanopore reads (in FASTQ format, may be gzipped)
    --ref-sam <filename>              SAM alignment file obtained by aligning reads to reference genome file
    --ref-bam <filename>              BAM alignment file obtained by aligning reads to reference genome file
    --sam     <filename,filename,...> Comma-separated list of SAM alignment files obtained by aligning reads to assemblies
                                      (use the same order as for files with contigs)
    --bam     <filename,filename,...> Comma-separated list of BAM alignment files obtained by aligning reads to assemblies
                                      (use the same order as for files with contigs)
                                      Reads (or SAM/BAM file) are used for structural variation detection and
                                      coverage histogram building in Icarus
    --sv-bedpe  <filename>            File with structural variations (in BEDPE format)

Speedup options:
    --no-check                        Do not check and correct input fasta files. Use at your own risk (see manual)
    --no-plots                        Do not draw plots
    --no-html                         Do not build html reports and Icarus viewers
    --no-icarus                       Do not build Icarus viewers
    --no-snps                         Do not report SNPs (may significantly reduce memory consumption on large genomes)
    --no-gc                           Do not compute GC% and GC-distribution
    --no-sv                           Do not run structural variation detection (make sense only if reads are specified)
    --no-gzip                         Do not compress large output files
    --no-read-stats                   Do not align reads to assemblies
                                      Reads will be aligned to reference and used for coverage analysis,
                                      upper bound assembly simulation, and structural variation detection.
                                      Use this option if you do not need read statistics for assemblies.
    --fast                            A combination of all speedup options except --no-check

Other:
    --silent                          Do not print detailed information about each step to stdout (log file is not affected)
    --test                            Run QUAST on the data from the test_data folder, output to quast_test_output
    --test-sv                         Run QUAST with structural variants detection on the data from the test_data folder,
                                      output to quast_test_output
-h  --help                            Print full usage message
-v  --version                         Print version

You can start the tool with:

quast.py

To call quast.py we have to provide a reference genome and one or more assemblies. The reference is the wuhan reference. Find the appropriate parameters in the usage and use:

-t <number of threads>

in addition.

The output should be stored in:

~/workdir/assembly/assembly_wgs/quast

When you are done (or stuck) go to the next page.

References¶

quast http://sourceforge.net/projects/quast