Samtools IdxStats: interpreting and troubleshooting unmapped reads

SANJAY · August 10, 2019, 2:13am

Hi Galaxy community members,
I am using BWA MEM (default setting) to map paired-end fastq files trimmed using trimmomatic. The reference sequence was downloaded from ncbi as a fasta file. The Idxstats result shows zero values for rows 2 (Reference sequence length) and 3 (Number of mapped reads). Can we consider that BWA MEM worked well in this situation?
Image attached idxstats%20output
Thank you
Sanjay

jennaj · August 10, 2019, 7:02pm

Hi @SANJAY

The last column represents unmapped reads. See the IdxStats tool form help to better understand the output (scroll down to the help section).

In short, technically this is fine.

Yet scientifically, this kind of count output could indicate a problem. A third of your reads didn’t map – that is very high for most analysis projects.

It is always possible that this is the best result you can get with your given inputs, but below are your options to confirm that. Or, you didn’t expect high mapping rates for some reason to start with.

Troubleshooting help if you expected a higher mapping rate:

There is some QA/QC problem with your reads:
- Low base-calling quality
- Contamination
- High redundancy
- BWA-MEM maps DNA. Use HISAT2 for RNA reads. When used with RNA, it is usually for data exploration purposes and a high number of unmapped would be expected.
The Custom genome assembly:
- Was not completely Uploaded to Galaxy (is truncated). Due to the way a single chromosome fasta dataset is formatted, this wouldn’t be obvious. You could compare source + Galaxy genome sizes (base count) – tool Compute sequence length.
- Is not of high quality or incomplete in some way (not “finished”)
- Some portion of your reads represent novel content not included in the assembly (this could be interpreted/investigated in many ways, including by assembling your reads)
- Does not represent the same genome as your reads. Mycobacterium tuberculosis has a few different strains, maybe a mismatched genome was mapped against? (if is even available)

Run FastQC on your data and review for potential issues. You also might need to adjust the BWA-MEM tool form settings to get better results. On the form most of the line command options are available (and explained) under the option Select analysis mode > Full list of parameters. You may need to do some testing of parameters. There is a link to the tool publication on the tool form help and much online discussion (mostly for command-line usage but still helpful for setting up the mapping in Galaxy).

I don’t think the Galaxy tutorials will help immediately. Those that include BWA-MEM use default settings or settings specific to the reads used in the tutorial – and those do not address this specific kind of troubleshooting, or rather, there is not enough information yet about why the mapping rates are low. Do more checks to find out what the problem space is, then you can review to see if that is covered by a tutorial.

Note: Before mapping again make sure that your custom genome has had the description line content removed (tool: NormalizeFasta). NCBI includes descriptions on the > title line of fasta data. That won’t cause problems with the mapping tools, but definitely will with many downstream tools (manifests as odd tool errors or successful but scientifically incorrect results). When this happens, it usually means that the analysis has to be started completely over from the mapping step. No one enjoys that. FAQs:

Thanks!

SANJAY · August 10, 2019, 8:11pm

Hi Jennaj,
Thank you for the response. I tried to normalize the fasta but did not get rid of (<) sign using the following settings:
The line length to be used for the output fasta file=100
Truncate sequence names at first whitespace=Yes

I am actually having a problem in variant detection and want to restart from mapping. I am performing following analyses: BWA-MEM -->Samtoolsmpileup(version 2.1.3, to obtain vcf) -->bcftools call -->snpeff. I am expecting around 800 variants but am getting 3 times of it with snpeff output where the number of errors=Number of variants processed
(i.e. after filter and non-variants)= Number of effects. Therefore, I assume these are not the actual variants. normalize%20fasta bwa%20mem mpileup%20err snpeff%20er ![snpeff|680x437]

Thank you
Sanjay

Topic		Replies	Views
bwa-mem does not map to entire reference genome reference-index , galaxy-local , data-manager , genome	10	3247	July 15, 2019
Map with BWA-MEM problem - An error occurred with this dataset: bam format mapping , bwa-mem	3	132	September 17, 2024
BWA-MEM results in "An error occurred setting metadata for this dataset" usegalaxy.org support metadata , mapping	10	2794	January 21, 2019
Mapping reads to references by PEAR, bowtie2, samtool idx and sort tools samtools , variant-analysis	4	455	June 2, 2023
NGS tutorial - Mapping trimmed fastq Files usegalaxy.org support	0	452	January 6, 2021

Samtools IdxStats: interpreting and troubleshooting unmapped reads

Related topics