Inspect the fast5 files

TODO: Mit alten Daten machen, raus oder fixen.

The raw signals in Nanopore sequencing are stored in HDF5 format. HDF stands for “Hierarchical Data Format”, and it is quite similar to json. Terms used by HDF include Groups, Attributes and Datasets. A Group can contain Groups or Datasets and may have Attributes. A Dataset contains an array of datapoints. The way HDF5 is stored allows it to access individual Groups within the dataset fast and efficient.

The HDF5 tools can be used to display contents of HDF5 files. We will use two of them to explore our data:

h5dump -- Enables a user to examine the contents of an HDF5 file and dump those contents to an ASCII file.
h5ls -- Lists specified features of HDF5 file contents.

In order to get the complete content of a fast5 in readable form, you can use:

h5dump <path to one of your fast5 files>

Inspect the output, you could redirect the quite long output to less or more:

h5dump <path to one of your fast5 files> | less

The file starts with a root group:

GROUP "/" {
[...]

followed by the first read:

GROUP "read_0001fc5e-e2b8-47aa-a20d-6b950611417b" {
[...]

at some point, the actual data is stored as a dataset:

DATASET "Signal" {
  DATATYPE  H5T_STD_I16LE
  DATASPACE  SIMPLE { ( 104805 ) / ( H5S_UNLIMITED ) }
  DATA {
  (0): 450, 414, 428, 435, 445, 439, 439, 416, 432, 437, 410, 403,
  (12): 429, 410, 415, 426, 424, 409, 415, 416, 422, 421, 422, 418,
  [...]

To get an overview on all reads, you could use the h5ls command:

h5ls data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5

This will give you a list of all reads:

read_0061d165-af04-4c39-ad5e-8c4ebe3caa80 Group
read_00e09394-e199-4738-bf1a-2d2f97dff4a8 Group
read_01204611-3592-419c-9785-83c0fafa4c4a Group
read_0130da91-b3ad-42bd-847e-0d2ce314ee48 Group
read_015af2b6-ef6c-4c86-b598-02325d42fc6d Group
read_01a509fd-a58c-4257-8136-600c22b4d053 Group
read_01ed0343-0286-42f1-8ea3-0904d66521b6 Group
read_025147d4-e101-426c-8b80-38d752d41dc8 Group
[...]

Which you can simply count to get the number of reads in your fast5 file:

h5ls data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5 | wc -l

In order to inspect what is stored for an individual read, you can specify that read, as if it were a directory using h5ls:

h5ls data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5/read_a3d14887-0d45-4ef5-8a20-42af8257053d
or (group2):
h5ls data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5/read_0061d165-af04-4c39-ad5e-8c4ebe3caa80

Which gives you the groups (“subdirectories”) for that Read:

PreviousReadInfo         Group
Raw                      Group
channel_id               Group
context_tags             Group
tracking_id              Group

Let’s assume, we are interested in the raw data of a specific read:

h5ls data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5/read_a3d14887-0d45-4ef5-8a20-42af8257053d/Raw
or (group2):
h5ls data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5/read_0061d165-af04-4c39-ad5e-8c4ebe3caa80/Raw

The output is:

Signal                   Dataset {104805/Inf}

So we have reached the actual raw data (indicated by “Dataset”). To view a dataset, h5ls has a ‘-d’ option:

h5ls -d data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5/read_a3d14887-0d45-4ef5-8a20-42af8257053d/Raw/Signal
or (group2):
h5ls -d data/fast5_tiny/GXB01322_20181217_FAK35493_GA10000_sequencing_run_Run00014_MIN106_RBK004_46674_0.fast5/read_0061d165-af04-4c39-ad5e-8c4ebe3caa80/Raw/Signal

Which will give you the raw signal of that specific read:

Signal                   Dataset {104805/Inf}
  Data:
  (0) 450, 414, 428, 435, 445, 439, 439, 416, 432, 437, 410, 403, 429, 410, 415, 426, 424, 409, 415, 416, 422, 421, 422, 418, 425, 424, 414, 419,
  (28) 434, 429, 424, 412, 423, 412, 411, 411, 409, 423, 421, 413, 408, 429, 422, 432, 432, 408, 438, 408, 428, 416, 418, 429, 427, 423, 434, 432,
  (56) 426, 418, 436, 440, 418, 415, 423, 428, 416, 419, 425, 430, 425, 423, 408, 428, 419, 424, 426, 426, 419, 428, 436, 421, 418, 412, 426, 430,
  (84) 438, 439, 426, 415, 444, 418, 419, 428, 433, 432, 415, 419, 426, 439, 411, 410, 414, 417, 426, 433, 430, 430, 412, 418, 418, 410, 423, 424,
  (112) 426, 412, 422, 410, 415, 416, 427, 407, 429, 439, 420, 430, 426, 420, 426, 424, 419, 424, 420, 415, 429, 418, 418, 424, 425, 425, 419,
  (139) 424, 424, 420, 419, 431, 440, 429, 418, 421, 421, 427, 421, 423, 410, 423, 432, 436, 426, 417, 425, 436, 425, 423, 418, 426, 425, 424,
  (166) 419, 422, 411, 427, 423, 424, 424, 423, 420, 430, 424, 426, 434, 405, 420, 419, 427, 423, 423, 432, 421, 430, 418, 433, 430, 424, 427,
  (193) 425, 421, 421, 437, 433, 422, 430, 412, 426, 416, 427, 426, 417, 420, 427, 417, 426, 427, 422, 435, 429, 425, 428, 428, 395, 432, 424,

Now that you have an idea of how the raw data out of the machine looks like, we can start the basecalling.