Read mapping to a reference (2)¶

Mapping the Nanopore data with Minimap2¶

We use:

minimap2

to compute the mapping and add the following parameters:

What?	parameter	Our value
The reference file	positional (1)	~/workdir/wuhan.fasta
The input file	positional (2)	~/workdir/data_artic/basecall_tiny_porechopped.fastq.gz
The output file	-o	~/workdir/mappings/basecall_tiny_porechopped_vs_wuhan.sam
The number of threads to be used	-t	14
The preset for Nanopore reads	-x	map-ont
Get sam output	-a

Create the mapping folder first:

mkdir ~/workdir/mappings/

The complete commandline for minimap2 is:

minimap2 -t 14 -x map-ont -a -o ~/workdir/mappings/basecall_tiny_porechopped_vs_wuhan.sam ~/workdir/wuhan.fasta ~/workdir/data_artic/basecall_tiny_porechopped.fastq.gz

Mapping the Illumina data with bwa¶

First, we use:

bwa index ~/workdir/wuhan.fasta

to create an index on the reference.

Then, we run the actual mapping with:

bwa mem

and the following parameters:

What?	parameter	Our value
The reference file	positional (1)	~/workdir/wuhan.fasta
The input file(s)	positional (2)	~/workdir/Illumina/D0003_S3_L001_R1_001.fastq.gz and ~/workdir/illumina/D0003_S3_L001_R2_001.fastq.gz
The output file	redirect with >	~/workdir/mappings/illumina_vs_wuhan.sam
The number of threads to be used	-t	14

The complete commandline for bwa mem is:

bwa mem -t 14  ~/workdir/wuhan.fasta  ~/workdir/Illumina/D0003_S3_L001_R1_001.fastq.gz ~/workdir/Illumina/D0003_S3_L001_R2_001.fastq.gz > ~/workdir/mappings/illumina_vs_wuhan.sam

Converting sam files to sorted and indexed bam files¶

In order to read the mapping files into a genome browser, we need to convert the .sam files to sorted .bam files. This is done, using 3 different tools from samtools, the first converts sam to bam:

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -X       include customized index file
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -d STR:STR
           only include reads with tag STR and associated value STR [null]
  -D STR:FILE
           only include reads with tag STR and associated values listed in
           FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
           query sequence >= INT [0]
  -f INT   only include reads with all  of the FLAGs in INT present [0]
  -F INT   only include reads with none of the FLAGS in INT present [0]
  -G INT   only EXCLUDE reads with all  of the FLAGs in INT present [0]
  -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
           fraction of templates/read pairs to keep; INT part sets seed)
  -M       use the multi-region iterator (increases the speed, removes
           duplicates and outputs the reads as they are ordered in the file)
  -x STR   read tag to strip (repeatable) [null]
  -B       collapse the backward CIGAR operation
  -?       print long help, including note about region specification
  -S       ignored (input format is auto-detected)
  --no-PG  do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --write-index
               Automatically index the output files [off]
      --verbosity INT
               Set level of verbosity

For our purpose, we need the options:

-b for sam to bam conversion

Redirect the output into files with name (or redirect directly to samtools sort - see further below):

~/workdir/mappings/basecall_tiny_porechopped_vs_wuhan.bam
or
~/workdir/mappings/illumina_vs_wuhan.bam

Then sort the bam file with samtools sort:

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -u         Output uncompressed data (equivalent to -l 0)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -M         Use minimiser for clustering unaligned/unplaced reads
  -K INT     Kmer size to use for minimiser [20]
  -n         Sort by read name (not compatible with samtools index command)
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  --no-PG    do not add a PG line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]
      --verbosity INT
               Set level of verbosity

Your sorted bam files should have the name:

~/workdir/mappings/basecall_tiny_porechopped_vs_wuhan.sorted.bam
or
~/workdir/mappings/illumina_vs_wuhan.sorted.bam

Then index the sorted bam file:

Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]
  -@ INT   Sets the number of threads [none]

If you are stuck - get help on the next page.

References¶

Minimap2 https://github.com/lh3/minimap2

BWA http://bio-bwa.sourceforge.net/

samtools http://www.htslib.org