bwa-mem does not map to entire reference genome

Hello,

I am using bwa-mem in my local galaxy to map paired-end sequencing reads to a bacterial reference genome. I just want to be sure I didn’t make a mistake with the analysis before asking whether this was a problem with sequencing - as fastq on the files looked good.

I used DMs to reference the genome using the following steps:

Create DBKey and reference genome - used existing dbkey (found my species), using accession number from NCBI for the fasta sequence of my reference genome, left ‘sort by chromosome name: as is’.

Then I created indexes using SAM FASTA index builder, Picard index builder, TwoBitbuilder and BWA-MEM index building (in this order).

Then I visualized the bwa-mem output (bam) file using Integrated Genome Browser (IGB). The reference genome is around 6.2Mb. When I open the bam alignment file on IGB, the reads only map to about 3Kb (the ‘load data’ button does not load new sequence data past this point).

1 Like

Additional details

I used the reference genome generated by the BWA INDEX tool.
The bwa-mem parameters - I selected paired-end reads, set SAM/BAM specifications (all auto) and used simple illumina mode

I appreciate any suggestions

My guess is that the reference genome you indexed has one or both of these two problems:

  • Contains description line content on the “>” title line. If present, it needs to be removed before using Data Managers. Tool: NormalizeFasta

  • Fasta is not truly a match for the genome with an “existing dbkey”. After checking/fixing item 1 above, create a “new dbkey” with the first (create) DM and use that for the other indexes.

Existing dbkeys do have some native data already cashed. If your genome is not an exact match for the associated cashed genome version (specifically chromosome names + length), expected problems.

In particular, using existing bacterial genome dbkeys will present problems. These genomes were sourced from the UCSC Microbial genome browser many years ago (10+). All had/have a very helpful (joke) single chromosome name of “chr”. This will create a mismatch with the NCBI accession named chromosome in the genome fasta you indexed with Data Managers.

Quick fix checklist:

  1. Use NormalizeFasta to create a very basic fasta formatted genome fasta. Specifically, wrap to 80 bases and truncate the title line at the first whitespace.
  2. Create your own dbkey instead of using an existing one. Could be the accession name. Add in some version/source information to the description to help with keeping track of where/when this was obtained (source, version, date).
  3. Rerun all Data Managers using that new dbkey.
  4. Optionally, remove the original indexes. Go into the directory where you keep indexed genomes and …
    • manually alter the .loc files touched by the Data Managers (comment “#” the lines for the pre-cached dbkey you used)
    • and delete any associated indexed data that is problematic (where the data is located is noted in the .loc files).
    • In the Galaxy UI under “Admin” you can also review .loc files under “Server > Data tables”.
    • Note: Step 4 is not recommended unless you are comfortable working on the command line and troubleshooting. If you decide not to do this, just avoid that other dbkey and accept that you’ll have some extra data around that cannot be used. Should you decide to do this, shut down Galaxy before making changes. When you start Galaxy up again, the updated data tables will be refreshed in the interface.

There is no “undo” function for Data Managers (yet) but it is something we consider to be an important enhancement. Sometimes completely starting over is what works best for people (setting Galaxy up from scratch).

Genome indexes are very specific – a species match is not enough. A genome version/build match is.

Hope that helps!

1 Like

Related Q&A Determine chromosome names in reference genome already in Galaxy?

Hello,

Thanks for your help, so I normalized the fasta file and created a new DBKey. How do I ensure I am using the correct dbkey for the data managers.

For example, when I go to rerun the DMs, in the “source fasta sequence” field, it gives me two options for fasta file, I didn’t know if there was a difference, so I purged my history, re-started my galaxy and still have the two options, when I really should not have any fasta files available at this point?

And just to confirm, when creating the new dbkey, for the field “sort by chromosome name”, which option should I choose?

Sorry, I was able to just re-name the new one and now I see it as an option. Is there a way to clear previous dbkeys just for convenience?

The duplicates are showing up because you indexed data for an existing dbkey with a mismatched genome. Deleting a history will not change the indexed genomes. This is an example of the type of problems I described before when Data Managers are run incorrectly (and cannot currently be “undone” through the Admin interface).

You could examine your .loc files and edit those by hand as explained above. Then go to the paths in the .loc files for the problematic indexes and delete that data. But as I warned earlier, this can easily create mistakes that can make the server usable.

So … three choices:

  1. Just accept that there is a problematic dbkey with associated data and ignore it.

  2. Manually edit your data tables and files. This is risky and not advised, but it can be done.

  3. Start completely over with a fresh Galaxy install.

I see, okay I can accept there is a problematic dbkey and ignore it. I opened SAM FASTA index builder, and I see my new dbkey as an option, and ran it but I got an error

Fatal error: Exit code 1 ()

Sorry, all the data managers run fine, I accidentally changed the datatype

Thanks for your help!

1 Like

Great, glad all is working now!!