compare Galaxy europe aligners (bowtie2, BWA and minimap2) and CLC genomics workbench

Dear Sir/Madam,

Recently I analyzed my fastq files using your galaxy aligners, bowtie2, BWA and minimap2. But mapped reads to references are very few, only 1, see below:
reference: QOD39769.1_Coxsackievirus_A6, length 6606, mapped reads 1

I then used CLC genomics workbench to analyze, and found many reads mapped to the references, see below:

|Reference name|Consensus length|Total read count|

|ACU27110.1_rhnovrus_sp mapping|141|48231|
|BAQ55332.1_Rhnovrus_C mapping|141|1912|
|QRG33173.1_partial_Rhinovirus_C mapping|371|1604|
|QED08818.1_Rhnovrus_C mapping|135|1459|
|BAQ55399.1_Rhnovrus_C mapping|140|1036|
|BAQ55397.1_Rhnovrus_C mapping|141|975|
|ADM08060.1_Rhnovrus_C mapping|140|900|
|AEM44641.1_Rhnovrus_C mapping|148|135|
|QRG33243.1_partial_Rhinovirus_C mapping|134|101|
|QWN56617.1_rhnovrus_C9 mapping|144|25|
|AEO37442.1_Echovrus_E5 mapping|178|8|
|UFT26851.1_rhnovrus_C9 mapping|128|6|
|UFT26753.1_rhnovrus_C9 mapping|128|6|
|QVG74571.1_rhnovrus_C31 mapping|122|6|
|UHU35009.1_rhnovrus_C31 mapping|122|5|
|AKM70887.1_Rhnovrus_C mapping|123|3|
|YP_001552411.1_genome_Rhinovirus_C mapping|120|3|
|YP_001552435.1_VP1_Rhinovirus_C mapping|110|1|
|QOD39769.1_Coxsackievirus_A6 mapping|149|1|
|AYW17095.1_enterovirus_D68 mapping|150|1|

I wonder why it is like this? Thanks.

Best regartds,
Chen

Hi @huiping

Maybe different parameters were used between the runs? This assumed the query, targets, and tool choices are otherwise similar.

You are running these tools through different platforms but that shouldn’t make a difference. You should be able to reproduce the results anywhere given the same tool, inputs, parameters. These platforms are mostly managing the technical details of “jobs” for you – not changing what the tools themselves produce. You are still in control of the scientific parts.

Does that help?

Thanks. I wonder how to change the parameters of your tools (BWA, bowtie2, minimap2) to get more mapped reads like CLC genomics workbench?

Best regards,

Chen

Hi @huiping

General help:

  • This tutorial covers how to map in general in Galaxy. Mapping

  • Your query reads type will matter for some tool choices.

    • If you are not sure what kind of data your reads represent, this tutorial can help. NGS data logistics
  • Most tools, including those you mention, have all (most?) of the command-line options implemented directly on the tool form. Those tend to be described inline where the option is set, with more information down in the help section. This includes the command flags. Maybe compare the command strings between the runs you are trying to replicate?

    • How to find the command-string in Galaxy is described here (in short, click on the “i” job information icon in any dataset to learn what was submitted, exactly: inputs, parameters, logs). The FAQ is labeled as being for errors, but the same advice also applies for odd outputs, or really for any reason! Troubleshooting errors

Please give those a review a try first. Then, if you need more help translating a command string, we’ll need more details about the job you are trying to replicate. The details will matter, so please be specific. My guess is that either the tool choice, or the target database you are mapping against are the root differences (so, at a higher level than tool alignment parameter choices) but we can confirm that. :slight_smile:

Thank you, Jennifer.

My job is for genotyping of enteroviruses from patients. As you may know, there are lots of subtypes of enteroviruses, so we use multiple fasta references to align with fastq files. When we used your galaxy tools, like bowtie2, BWA, etc., only very few reads aligned to the multi-fasta references, but with CLC genomics workbench, lots of reads found to align to the multi-fasta references.

We tried to tweak the parameters of the bowtie2 tool, but no improvement could be seen.

Best reagrds,
Chen

Hi @huiping

Thanks for sharing those details, very helpful!

Bowtie2 and BWA both map DNA Illumina reads, so please confirm that is also your read type.

Both of those tools were designed to report only the best same-species matches versus a single genome. Adjusting alignment parameters help more with finding weaker hits rather than the best hit per target sequence. The sequences in the target are expected to all be from the same genome, e.g. chromosomes or contigs, not distinct assemblies.

For cross-species mappings, a tool like BLAST+ is likely a better choice. BLASTN+ will map nucleotide reads to a nucleotide target which sounds like what you are doing. Other tools in the suite can handle protein/nucleotide targets and queries in different combinations – the tool forms explain what each does, or you could read about them in detail at NCBI.

I’m guessing that the CLC workflow you are running incorporates a tool like BLAST+, and not the other two tools for the same reasons outlined above. From what I have read in their online tutorials, details about the published workflows or bundled tools are probably accessible in the web application somewhere. If you cannot find those details, maybe ask them for help?

Your earlier summary of the hits doesn’t include alignment quality statistics but that is is likely available. Meaning, some hits will be a better quality match than others. If the CLC results were filtered for exact/best matches per target sequence, many of those hits will probably fall out. You’ll need to do that same sort of filtering with raw BLAST results as well. Bowtie2/BWA apply a “best match” type of filtering at runtime (on purpose!).



Replicating a published analysis pipeline is usually a good place to start when deciding which tools to use. My suggestions are very general, and mostly about explaining why your results may differ when using these specific tools.

Galaxy has a suite of tutorials here Galaxy Training!. Many involve replicating a prior published result, and include a workflow that can be imported and customized. If these don’t cover your exact use case, you can run the custom analysis, then extract your own workflow for reuse. Searching the tool panel with the keywords (example: “genotype”) is another resource to get oriented. Full references, including tool author publication links, are at the bottom of each tool’s form.

Hope this helps!

1 Like

Hi Jennifer,
Our read type is illumina DNA read. We align reads to multiple references from same species. I have adjusted the alignment parameters, like sensitivity, seed sizes, but no any improvements can be seen. I wonder if you could tell me how to adjust the alignment parameters? Thanks.
Best regards,
Chen

Right, this was expected and is what I meant about “weaker” hits in the prior reply. Adjusting the parameters to be more sensitive might help some reads find a hit that didn’t originally. But if those same reads already had a better hit originally, that best hit is still what will be reported.

Technically, you could also adjust how many alignments per query are reported if you haven’t done that yet. Those sub-hits usually are not wanted for genotyping e.g. identifying the best target per query but maybe you are doing something else too?

And, whether or not you care about any new queries finding top weaker alignments is another thing to consider… maybe the sequencing quality of the reads is not great, or the target genomes are not in a polished assembly state, so you need to be more tolerant about those weaker, but still “best”, matches.

Examine the alignment statistics to see what is going on: are more reads getting a match (at all?). If you are not getting more reads to capture a hit against the target database, you may have already found the best possible matches per query read, and are probably just capturing sub-hits (weaker matches additive to the already stronger matches).

Have you tried using BLAST+ yet? This tool has options for a question like “report all hits that meet my minimum mapping criteria”. So, a different question from Bowtie2, or BWA, where the question is instead “report the best equivalent hits that meet my minimum mapping criteria”.

Did you find out what tool generated your original alignments? If you have the BAM output, you could import into Galaxy, run some statistics (samtools), and examine the alignments from the context of how many reads found a hit, and whether that is a primary or a sub-hit. So – a direct comparison with the same statistics. The prior statistics were from the perspective of the target e.g. how many targets are present in the result, and how many query sequences for each. It doesn’t tell you anything about how many query sequences had a hit, and how those hits sort out (primary or not, etc).

Dear Jennifer,

I used the samtool indx to analyze the data, and found no reads were mapped to any references, see below. I wonder how to tweak the alignment parameters to get mapped reads? Thanks.
Best,
Chen

Hi @huiping

So, originally a few reads mapped and now after adjusting parameters, no reads map at all?

Maybe back up and try a direct replication of your original mapping. You haven’t shared what tool that was yet.

And, Samtools has a few tools to generate statistics. The first is what you are sharing, and it is counting with respect to the targets (aka “BAM index”). The second counts with respect to the query (the reads).

  • Samtools idxstats reports stats of the BAM index file
  • Samtools flagstat tabulate descriptive stats for BAM dataset

Trying to reproduce a prior mapping involves: inputs (query + target), tool, and parameters. That is what I meant in the original reply.

Bowtie2 and BWA are somewhat similar. Minimap is different, and BLAST is different. And, the original mapper may be similar or not to any of those, and there are other mappers we haven’t discussed here yet, but we don’t know what that original tool was yet.

If you are having trouble finding out what the original mapping tool was … you could check the BAM header to see if the command was captured (would be by default, any tool) or ask the other service for clarification. Those details seem important anyway – doing analysis in a black box makes it harder to trust the results. And if for serious work (medical, publication, etc) you wouldn’t be able to describe the methods in enough detail to allow others to reproduce the analysis. You might not even be able to reproduce your own analysis e.g. if that black box changed over time, how would you know?

Please try to find those details, and we can try to help more here.

I should add … you could also decide to not focus on raw mapping results, and instead follow a genotyping protocol. That would involve data reduction steps that can get rid of a lot of initial noise (such as extra alignments that don’t contribute), and some top level conclusions. Those conclusions could be compared across methods to make decisions about the suitability of each protocol versus your goals. The tutorials I linked above cover variant calling for genotyping purposes, and more tools are available directly in the application. Or, maybe find a published method that you like and start with that?

Hi, Jennifer,

I used BWA for mapping, and nothing can be mapped. When I used bowtie2, just 1 read was mapped. But with minimap2, about 7 reads were mapped.

Best regards,
Chen

Hi @huiping

You could still share your history, and we’ll help with a more detailed review.

That FAQ above has a link to this FAQ with exact instructions :slight_smile: Sharing your History