Polishing with medaka¶
Medaka is a tool to create a consensus sequence of nanopore sequencing data. This task is performed using neural networks applied a pileup of individual sequencing reads against a draft assembly. It outperforms graph-based methods operating on basecalled data, and can be competitive with state-of-the-art signal-based methods whilst being much faster.
In earlier courses, we used nanopolish for polishing but it is outperformed by medaka in both runtime and accuracy.
As input medaka accepts a sorted and indexed BAM mapping file. It requires a draft assembly as a .fasta.
Medaka hast 3 steps / subtools:
mini_align (basically runs a minimap2 mapping)
medaka consensus (generates a consensus, you can do that for subparts of the assembly to improve runtime)
medaka stitch (to stitch the subparts together, or generate a fasta from the results from medaka consensus)
However, for smaller assemblies, we can just use medaka_consensus
that performs all the steps above:
medaka 1.2.0
------------
Assembly polishing via neural networks. The input assembly should be
preprocessed with racon.
medaka_consensus [-h] -i <fastx>
-h show this help text.
-i fastx input basecalls (required).
-d fasta input assembly (required).
-o output folder (default: medaka).
-g don't fill gaps in consensus with draft sequence.
-m medaka model, (default: r941_min_high_g360).
Available: r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r10_min_high_g303, r10_min_high_g340, r941_min_fast_g303, r941_min_high_g303, r941_min_high_g330, r941_min_high_g340_rle, r941_min_high_g344, r941_min_high_g351, r941_min_high_g360, r941_prom_fast_g303, r941_prom_high_g303, r941_prom_high_g330, r941_prom_high_g344, r941_prom_high_g360, r941_prom_high_g4011, r941_prom_snp_g303, r941_prom_snp_g322, r941_prom_snp_g360, r941_prom_variant_g303, r941_prom_variant_g322, r941_prom_variant_g360.
Alternatively a .hdf file from 'medaka train'.
-f Force overwrite of outputs (default will reuse existing outputs).
-t number of threads with which to create features (default: 1).
-b batchsize, controls memory use (default: 100).
-i must be specified.
We need to define the following parameters:
-i <input fastq>
-d <racon reference assembly>
-o <output folder>, should be: ~/workdir/assembly/assembly_wgs/medaka/
-t <threads>
-m <the appropriate medaka model>
The model are named with the following scheme:
{pore}_{device}_{caller variant}_{caller version}
Our pore is r941, the device is MinION (min), we take the high-accuracy model (high), and our guppy version was 4.15. Choose the model, that is closest to that basecaller version.
If you are stuck, get help on the next page.