BamLeftAlign error unable to find fasta index -- use "fasta" version of genome or natively indexed genome

Hello,

I am performing Variant calling using the following protocol https://galaxyproject.org/tutorials/var_hap/

I want to compare 2 experimental samples (paired-end reads) against the reference Pseudomonas aeruginosa PAO1 genome. I mapped both these experimental samples against PAO1 (I input the fasta sequence from NCBI and used bwa to index, then mapped using bwa-mem. I merged the 2 sets of bam files, removed duplicates which produced one bam output file. I input this bam file into the BamLeftAlign tool, used the default pseudomonas aeruginosa PAO1 reference genome from the drop-down arrow, with max 5 iterations. I received the error: unable to find FASTA index.

Could not display BAM file, error was:
file has no sequences defined (mode=‘rb’) - is it SAM/BAM format? Consider opening with check_sq=False

In the protocol, they use the same bam output file after removing duplicates, so I am unsure what is wrong here?

1 Like

You are running this on a local Galaxy?

And this step failed? https://galaxyproject.org/tutorials/var_hap/#left-aligning-indels

If so, did you index the genome for Samtools, Picard, and 2bit with Data Managers? Just indexing for BWA is not enough.

See this prior Q&A for the way to fully index a genome so that it works with all tools (BAM data in particular relies on the extra indexes listed out in the post): Indexing reference genomes with Data Managers: Resources, tutorials, troubleshooting

Thank you for your reply,

Yes this is a local galaxy. On the bwa-mem step, I used the option to put in a fasta sequence and then used bwa-is to index, that is not enough? The mapping step did not fail.

For the left align indels step, I just used the PAO1 reference genome in the drop down menu, not the genome that I indexed. I could index myself as you suggest, is there a way for me to add it to the drop down menu in the bamleftalign tool

Ok, that explains the problem.

The genome is in a list, with some basic data, but hasn’t been fully indexed on your server.

When working with BAM data, the Samtools index is important. Some tools also use the Picard and 2bit.

If you instead want to continue using the genome as a “custom genome” fasta from the history, you’ll need to 1) make sure the formatting is cleaned up 2) promote the genome to a “custom build”, then assign that database to your data. Some tools require a database is assigned – either natively indexed by a Data Manager, or indexed just in your account as a Custom Build.

Warning: using a custom genome or build takes up much more resources, every time you run a tool. Using data managers to index the genome will make jobs run much faster.

When there are custom genome problems, problems tend to not show up during mapping, but with later steps. The fasta format is very important, or expect more problems. This may mean that you will need to remap.

The first FAQ covers the details and the second explains how to avoid even more problems when incorporating reference annotation (if you decide to do that). In short, whatever chromosome identifier(s) you decide to use in your genome, it must be an exact match for the chromosome identifier(s) in other inputs.

https://galaxyproject.org/support/

Hi,

Sorry I am a little confused. So my problem is with my reference genome index? Instead of indexing the reference genome during the mapping step, I should index it separately through Data Managers or do a Custom Build? You mention going through Data Manager is better so I am assuming I need to contact an admin to install: Data Manager for building BWA (0.6+) indexes?

1 Like

This is a problem with either the database assignment issue, an missing index, or more likely both. The samtools index is probably missing.

This is your own local Galaxy and you are the admin, correct?

As am admin, you will need to install special tools to index genomes (Data managers). After installed, run them. The genome to index can be your fasta file used as a Custom genome. Ensure the formatting is Ok first. This post I already shared has DM instructions.

If you do not want to create local indexes, then you should promote the Custom genome to a custom build, then assign that build as a database to your BAM file. Custom genome/build help is above in my last reply.

Hello,

Okay I am sorry that is my issue, I haven’t set up my own local Galaxy I have just logged in to usegalaxy.org. I will set that up now, and then install the necessary tools

Thank you

You can still keep on working at Galaxy Main https://usegalaxy.org if you want. Many choose this option when working with will smaller genomes.

But you need to set up the custom genome/build and assign that database to your data when using this tool and others like it.

Once the custom build is created in your account, you’ll find it in your list of “databases” that can be assigned.

Hello,

I am still getting an error from BamLeftAlign:
file has no sequences defined (mode=‘rb’)

I used data managers to create a custom genome reference (downloaded fasta, normalize, create new DBkey, used multiple DMs to index (SAM, picard, 2bit, bwa)

I then mapped with bwa-mem - I used both auto-assign and manually entry (both times I left the following fields empty: flow order, KS, PG), but the rest will filled in (including PU)

I then used mark duplicates tool
parameters: remove_duplicates:NO; assume_sorted:YES; scoring_strategy:SUM OF BASE QUALITIES; max offset:100; validation stringency:LENIENT
I left the following fields blank: ‘regular expression to parse read names’, ;barcode tag’

I ran bamleftalign on the bam output from the bwa-mem mapping step, and after the markduplicates step. I have tried using the normalized fasta file as reference, then I tried using the indexed references (from all the DMs, but they do say “unavailable” in the name) as reference

All of my attempts have given me the error “file has no sequences defined (mode=‘rb’)”

I am assuming it is likely a problem with the bam outputs - something I did in the bwa-mem step? Is there a problem with the indexed reference genomes that make them unavailable?

1 Like

This means that there is a problem with the BAM input dataset.

Does it contain data lines (not just the header)? Click on the eye icon and scroll down past the header to check.

If no hits, did the HISAT2 mapping even work? Check the statistics. Or run a BAM/SAM summary tool from the Samtools/Picard tool groups.

Indexes genomes should show up on the tool form. Not sure why they wouldn’t unless the 1) indexing failed 2) dbkeys were mixed up again 3) the most current version of Galaxy (19.05) and of tools (DMs, mappers, other tools) are not being used.

Let’s troubleshoot from there.

Hello,

There is data lines, this is a screen shot of the bwa-mem bam output

I can run a summary now and show you what that looks like

As for the indexed genomes, this is an example of what it looks like if I try to perform bamleftalign on the bwa-mem bam output using the samtools indexed genome - I only see this genome as an option if I click “browse genomes”, the automatic genome that comes up is the normalized fasta file

Thank you so much!

1 Like

The history is in a purged (permanently deleted) state.

Any datasets included in it are also in a purged state, even though that seems to NOT be reflected in the display. It is not clear why.

Purged data is non-useable.

Might work, might not: try the history menu option “Copy Datasets” to move the data that doesn’t appear to be purged yet into a new history. You might want to start naming your histories, too. Having a bunch of “Unnamed history” items in your Saved Histories list will get confusing.

Hello,

Okay, I started a new saved history, and tried re-doing all the indexed reference genomes/bwa-mem mapping again

My indexed reference genomes still appear “unavailable” as shown below, the normalized fasta file is the first option to appear in the field, if I use any of the indexed genomes or normalized fasta file, I still get the same error.

During the bwa-mem mapping step, I set SAM/BAM specifications, do auto-assign three times (YES) and filled out the platform unit manually, therefore I left the following fields blank: Flow Order (FO), Array of nt bases that correspond to key sequence… (KS), Programs used for processing read group (PR), Predicted median insert size (PI)

I am assuming there is something wrong with how I have filled out these fields…

The error is still the same - no sequences defined (mode= ‘rb’)

1 Like

Two choices:

  1. Change the “source for the reference list” to be “Locally cashed”. If you indexed your genome, it will show up in the pull-down menu.

  2. OR – Input the fasta version of the genome (dataset 6) for “Using reference file”, not the dataset that represents the DM job that created the local index. This doesn’t take advantage of genome already being indexed (not sure why you would want to do this).

Some data organization advice: Run data managers in a distinct history that is clearly named, contains the original fasta (when one is used from the history), and keep all of that separate from your analysis. If you need the fasta for some reason, you can always copy it into an analysis history. Copied datasets owned by you, and cloned into other histories, do not consume additional disk space.

Small update: Reading back it sounds like you indexed the same genome more than once. If the same “dbkey” was used, this creates duplicated lines in the data tables, which can lead to all sorts of problems, including native indexes not showing up in pull-down menus.

That is difficult to clean up if actually done. There is no “undo” for Data Managers. I can link you to another post where this was discussed in detail that explains your options (need to find it first and post it back in another reply).

Ok, this post explains what happens when a mismatched dbkey is used. It also explains how to clean up data tables (although it is NOT recommended, tedious, and error prone). But, perhaps may help. bwa-mem does not map to entire reference genome

You can view all of your data tables under the “Admin” menu to check if you have duplicates or not.

Other posts with the tag “data-manager” have descriptions/troubleshooting for other situations and may be helpful to review.

Hello,

Thank you for getting back to me,

I tried, but if I change to locally cashed, there are ‘no options available’ for reference genomes.

So I did create multiple dbkeys, but I was careful to give a distinct name to the one I did properly (download fasta, normalize, create new dbkey…).

I used that new dbkey to index multiple times, I thought I was supposed to do this from what I read in another thread it was recommended to index with SAM FASTA indexer, picard indexer, 2bit indexer and then any tool-specific indexes you need (I performed BWA-INDEX). I used the bwa-index for the bwa-mem mapping step, I just read that the other indexes would be preferred for potential downstream tools. Was I not supposed to do this?

I do have a duplicate from the previous dbkey I tried to create, but the one I did properly is called myPAO1 and there is only one of that file

What exactly does the error mean?
no sequences defined mode=‘rb’

Hello,

Looking further into it, I do have duplicates of the indexed genomes as shown below, maybe this is my problem as you mentioned?.

Maybe best at this point to shut this local galaxy down, and start from scratch.

1 Like

Yes, starting over is often the best choice, especially if you haven’t heavily invested in the instance so far yet.

Once you’ve done this successfully for one genome, you’ll understand how it works better for next time. Inspect the data tables after running the DMs (restart Galaxy first) to confirm you have the desired results (for example: no duplicates), and all should go smoother…

Later on, you can explore a utility that automates DM + genome installation/indexing: Ephemeris – small warning: this utility does require some admin/coding skills, at least for now, may have a GUI layer put on top in some future version (an idea still under discussion regarding implementation details).

Or, just continue to use the DMs directly, keeping it simple, especially if you only have one or a few genomes to index.

Again, I strongly advise keeping your genome data/DM runs organized. Do this any way that makes sense for you. I personally name my histories with the dbkey, add a history tag “datamanager” to make these easier to find in case I want to add a new tool that requires a new index for the same genome, annotate the history with notes (dates particular DMs were run), keep the original and normalized fasta in the history (often annotating it with the date/source details), and then run all DMs, for that genome/build, in the same history, even new ones run later on. I do the same for non-genome-specific data managers, as some tools have DMs that create indexes not focused on individual genomes.

Sorry this happened. Fixing DM mistakes, including data tables and linked data, can be challenging, even for me, and I create AND fix data all the time. Everyone makes mistakes. Some of that doesn’t go so well either, requiring even more work to “fix the broken fix”… you get the idea… :sweat:

Let us know how this works out next time!

Thank you so much for all your help

I uninstalled everything and started from scratch.
I will definitely keep things more organized moving forward.
I just wanted to ensure before I start running the DMs on my new DBKey that I do not need to set the Database/Build - ie. can leave as unspecified. Just want everything to be as clean as possible this time.

I changed this last time to the species name of the reference genome, but I am wondering if that led to any problems

Thank you!!!

Alexa

1 Like