Hi Galaxy community members,
I am using BWA MEM (default setting) to map paired-end fastq files trimmed using trimmomatic. The reference sequence was downloaded from ncbi as a fasta file. The Idxstats result shows zero values for rows 2 (Reference sequence length) and 3 (Number of mapped reads). Can we consider that BWA MEM worked well in this situation?
Image attached
Thank you
Sanjay
Hi @SANJAY
The last column represents unmapped reads. See the IdxStats
tool form help to better understand the output (scroll down to the help section).
In short, technically this is fine.
Yet scientifically, this kind of count output could indicate a problem. A third of your reads didn’t map – that is very high for most analysis projects.
It is always possible that this is the best result you can get with your given inputs, but below are your options to confirm that. Or, you didn’t expect high mapping rates for some reason to start with.
Troubleshooting help if you expected a higher mapping rate:
- There is some QA/QC problem with your reads:
- Low base-calling quality
- Contamination
- High redundancy
- BWA-MEM maps DNA. Use HISAT2 for RNA reads. When used with RNA, it is usually for data exploration purposes and a high number of unmapped would be expected.
- The Custom genome assembly:
- Was not completely Uploaded to Galaxy (is truncated). Due to the way a single chromosome fasta dataset is formatted, this wouldn’t be obvious. You could compare source + Galaxy genome sizes (base count) – tool
Compute sequence length
. - Is not of high quality or incomplete in some way (not “finished”)
- Some portion of your reads represent novel content not included in the assembly (this could be interpreted/investigated in many ways, including by assembling your reads)
- Does not represent the same genome as your reads. Mycobacterium tuberculosis has a few different strains, maybe a mismatched genome was mapped against? (if is even available)
- Was not completely Uploaded to Galaxy (is truncated). Due to the way a single chromosome fasta dataset is formatted, this wouldn’t be obvious. You could compare source + Galaxy genome sizes (base count) – tool
Run FastQC on your
data and review for potential issues. You also might need to adjust the BWA-MEM
tool form settings to get better results. On the form most of the line command options are available (and explained) under the option Select analysis mode > Full list of parameters
. You may need to do some testing of parameters. There is a link to the tool publication on the tool form help and much online discussion (mostly for command-line usage but still helpful for setting up the mapping in Galaxy).
I don’t think the Galaxy tutorials will help immediately. Those that include BWA-MEM use default settings or settings specific to the reads used in the tutorial – and those do not address this specific kind of troubleshooting, or rather, there is not enough information yet about why the mapping rates are low. Do more checks to find out what the problem space is, then you can review to see if that is covered by a tutorial.
Note: Before mapping again make sure that your custom genome has had the description line content removed (tool: NormalizeFasta
). NCBI includes descriptions on the >
title line of fasta data. That won’t cause problems with the mapping tools, but definitely will with many downstream tools (manifests as odd tool errors or successful but scientifically incorrect results). When this happens, it usually means that the analysis has to be started completely over from the mapping step. No one enjoys that. FAQs:
- Preparing and using a Custom Reference Genome or Build
- Mismatched Chromosome identifiers (and how to avoid them)
Thanks!
Hi Jennaj,
Thank you for the response. I tried to normalize the fasta but did not get rid of (<) sign using the following settings:
The line length to be used for the output fasta file=100
Truncate sequence names at first whitespace=Yes
I am actually having a problem in variant detection and want to restart from mapping. I am performing following analyses: BWA-MEM -->Samtoolsmpileup(version 2.1.3, to obtain vcf) -->bcftools call -->snpeff. I am expecting around 800 variants but am getting 3 times of it with snpeff output where the number of errors=Number of variants processed
(i.e. after filter and non-variants)= Number of effects. Therefore, I assume these are not the actual variants.
![snpeff|680x437]
Thank you
Sanjay