Generating Error Profiles

Inferring error profiles using samtools

After mapping the reads on the reference Genome, we can infer various statistics as e.g., number of succesful aligned reads and bases, or number of mismatches and indels, and so on. For this you could easily use the tool collection samtools, which offers a range of simple CLI modules all operating on mapping output (SAM and BAM format). We will use the stats module now:

Usage: samtools stats [OPTIONS] file.bam
       samtools stats [OPTIONS] file.bam chr:from-to
Options:
    -c, --coverage <int>,<int>,<int>    Coverage distribution min,max,step [1,1000,1]
    -d, --remove-dups                   Exclude from statistics reads marked as duplicates
    -X, --customized-index-file         Use a customized index file
    -f, --required-flag  <str|int>      Required flag, 0 for unset. See also `samtools flags` [0]
    -F, --filtering-flag <str|int>      Filtering flag, 0 for unset. See also `samtools flags` [0]
        --GC-depth <float>              the size of GC-depth bins (decreasing bin size increases memory requirement) [2e4]
    -h, --help                          This help message
    -i, --insert-size <int>             Maximum insert size [8000]
    -I, --id <string>                   Include only listed read group or sample name
    -l, --read-length <int>             Include in the statistics only reads with the given read length [-1]
    -m, --most-inserts <float>          Report only the main part of inserts [0.99]
    -P, --split-prefix <str>            Path or string prefix for filepaths output by -S (default is input filename)
    -q, --trim-quality <int>            The BWA trimming parameter [0]
    -r, --ref-seq <file>                Reference sequence (required for GC-depth and mismatches-per-cycle calculation).
    -s, --sam                           Ignored (input format is auto-detected).
    -S, --split <tag>                   Also write statistics to separate files split by tagged field.
    -t, --target-regions <file>         Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.
    -x, --sparse                        Suppress outputting IS rows where there are no insertions.
    -p, --remove-overlaps               Remove overlaps of paired-end reads from coverage and base count computations.
    -g, --cov-threshold <int>           Only bases with coverage above this value will be included in the target percentage computation [0]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --verbosity INT
               Set level of verbosity

You can use:

-@ 14

to use more than one thread, but this shouldn’t be necessary for the size of our dataset.

Forward the output of samtools stats into a file called:

~/workdir/mappings/basecall_tiny_porechopped_vs_wuhan.stats

Inspect the results using less or more. If you are stuck - get help on the next page.