Polishing with Racon

Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies.

Racon can be used as a polishing tool after the assembly with either Illumina data or data produced by third generation of sequencing. The type of data inputed is automatically detected.

Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format. Output is a set of polished contigs in FASTA format printed to stdout. All input files can be compressed with gzip (which will have impact on parsing time).

We are going to use racon to do an initial correction. The medaka documentation advises to do four rounds with racon before polishing with medaka since medaka has been trained with racon polished assemblies. We are only doing one round here.

Mapping of Nanopore reads to the assembly

In order to use racon, we need a mapping of the reads to the assembly. We use minimap2 for this task.

Map the data to the assembly:

minimap2 -a -t 14 ~/workdir/assembly/assembly_wgs/assembly.contigs.fasta ~/workdir/data_wgs/Cov2_HK_WGS_small_porechopped.fastq.gz  > ~/workdir/mappings/Cov2_HK_WGS_small_porechopped_vs_assembly_wgs.sam

Run racon

Check the usage of racon:

racon --help
usage: racon [options ...] <sequences> <overlaps> <target sequences>

  <sequences>
      input file in FASTA/FASTQ format (can be compressed with gzip)
      containing sequences used for correction
  <overlaps>
      input file in MHAP/PAF/SAM format (can be compressed with gzip)
      containing overlaps between sequences and target sequences
  <target sequences>
      input file in FASTA/FASTQ format (can be compressed with gzip)
      containing sequences which will be corrected

  options:
      -u, --include-unpolished
          output unpolished target sequences
      -f, --fragment-correction
          perform fragment correction instead of contig polishing
          (overlaps file should contain dual/self overlaps!)
      -w, --window-length <int>
          default: 500
          size of window on which POA is performed
      -q, --quality-threshold <float>
          default: 10.0
          threshold for average base quality of windows used in POA
      -e, --error-threshold <float>
          default: 0.3
          maximum allowed error rate used for filtering overlaps
      -m, --match <int>
          default: 5
          score for matching bases
      -x, --mismatch <int>
          default: -4
          score for mismatching bases
      -g, --gap <int>
          default: -8
          gap penalty (must be negative)
      -t, --threads <int>
          default: 1
          number of threads
      --version
          prints the version number
      -h, --help
          prints the usage

We need to call:

racon

with our reads, our mapping (in sam format) and the reference assembly (in that order). We use 14 threads:

-t 14

And in addition the following parameters:

-m 8 -x -6 -g -8 -w 500

… because they were also used for the training of medaka and we want to have similar error profiles of the draft.

Racon Problems

If you are having trouble running racon and get a “Illegal instruction (core dumped)” message, try reinstalling with the following commands:

sudo rm /usr/local/bin/racon
git clone --recursive https://github.com/lbcb-sci/racon.git racon
cd racon
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
sudo make install
cd
rm -rf ~/racon/