How to interpret and generate statistics for any dataset, including BAM or SAM or Tabular data

eyoungman · March 27, 2019, 5:51am

I’d like to compare the result of aligning a small RNA-seq dataset using HISAT2 vs Bowtie. (I realize Bowtie is rather old, but I’d like to compare to old data analyzed with Bowtie.) I aligned the same FASTQ file using the 2 programs, and then ran samtools stats. The “raw total sequences” line in the stats output for the HISAT2 bam file matches up with the number of sequencing reads in the original FASTQ file (based on FASTQC), but the total sequences line from the Bowtie bam file is inflated by more than a million sequences. Can anybody see what I might be missing here? Am I misinterpreting the meaning of “raw total sequences?” I expected to see different numbers of aligned reads, but not total input reads.

jennaj · March 29, 2019, 1:35am

Welcome, @eyoungman

I took a look at one of your datasets mapped with both tools versus different tools that count up “reads” (and sometimes “alignment lines” instead!).

Fastq Groomer

Total seqs in the input fastq dataset reported: 2182610

FastQC

Total sequences in the input fastq dataset reported: 2182610
Total sequences in HISAT2 result BAM (sorted): 2509206
Total sequences in Bowtie result SAM (unsorted): 4345310
Total sequences in Bowtie result SAM (unsorted) converted with Sam-to-BAM to a BAM (sorted): 4345310

Samtools stats

Total “raw total sequences” in HISAT2 result BAM (sorted): 2182610
Total “raw total sequences” in Bowtie result SAM (unsorted): 4345310
Total “raw total sequences” in Bowtie result SAM (unsorted) converted with Sam-to-BAM to a BAM (sorted): 4345310

HISAT2

Total input reads reported the built-in mapping stats: 2182610

Bowtie

No built-in stats for this older tool

Line/Word/Character count

Some plain data manipulation to prepare first: Convert BAM-to-SAM or Sam-to-Fastq or Fastq-to-tabular, remove any headers with Select, and count the alignment lines. Then Cut out the reads names (first column) and finally get the Unique occurrences of each record. This is certainly a more complicated way to do the counts but is the “truth” of the content and can work on any data, with the right prep manipulations, if you are ever unsure about what is going on with a particular tool. Manipulating data this way doesn’t rely on special tools to interpret metadata/flags in non-obvious ways – plus you can compare raw text data counts to those from other tools to get a better idea about what those may be actually counting up.

Total “unique reads” in original Fastq (dataset 322 hidden, contained in collection dataset 323 as first dataset): 2182610
Total “alignment lines” in HISAT2 result BAM: 2509206
Total “unique reads” in HISAT2 result BAM: 2182610
Total “alignment lines” in Bowtie result SAM: 4345310
Total “unique reads” in Bowtie result SAM: 2182610

In summary, each reported alignment line or fastq entry (Fastq, BAM, SAM) is being counted up by some tools (including, in some cases, duplicated reported reads) and others are counting up true unique reads/sequences.

This is unfortunate but expected, and part of the reason the updated mapping tools (including Bowtie2) produce native tool mapping stats plus output sorted/better annotated BAMs by default.

Hope that helps, both to help make sense your data and to definitively confirm that all original reads were mapped by both tools. There are probably a hundred different ways to do this with Text/data manipulation tools (and line-command, outside of Galaxy with Unix utilities/scripts), but the tools I used should be a useful start, do not involve complicated regular expressions/programming, and are all included in Galaxy with simple usage.

Thanks!

eyoungman · June 12, 2019, 2:00pm

A belated but sincere thanks for this! Incredibly useful - I hadn’t thought of using Galaxy for manipulating files in this way.