Freebayes errors out in v. 1.3.6 vs 1.3.1

Hello,
I have recently installed a local version of Galaxy and installed FreeBayes 1.3.6 from the tool shed. After indexing the genomes hg19 and hg38 as explained in many posts, I ran HiSat2 to generate the BAM files for input in to FreeBayes (version 1.3.6). This version of FreeBayes fails to recognize the BAM files generated by HiSat2 and gives the following error: “Sequences are not currently available for the specified build”. The Galaxy.eu server version of FreeBayes is able to identify and process the BAM files generated by HiSat2 without any issues.
When I downgrade the FreeBayes version to 1.3.1 on the local instance of Galaxy, FreeBayes is able to recognize the BAM files and proceeds to process them. I am wondering if this anomalous behaviour is related to the latest version and how can I fix it (other than downgrading the FreeBayes version)?
The FreeBayes version that I am using on the Galaxy.eu server is 1.1.0.46

Thank you

Update

This was bothering me so decided to dig a bit more :slight_smile:

The root problem is one of these:

  1. The input BAM did not have a database assigned
  2. The input BAM has a different or mismatched database assigned
  3. The genome fasta was not originally indexed by using this Data Manager data_manager_fetch_genome_dbkeys_all_fasta, or manually indexed to do the same (involves more than just the fasta.fai index itself). All done without any conflicts like a duplicated dbkey or missing indexes (including indexes that point to other indexes :melting_face:)

Docs Galaxy Community Hub - Galaxy Community Hub

The “first step” for all newly indexed genomes is very important.

  • Galaxy has a built-in list of dbkeys for known/common genomes to enable consistent labeling across all (most?) Galaxy installs. Those keys are indexed themselves for UI functions (Upload tool, assigning/changing database metadata).

    • If you create a new dbkey it is either added to that master index of all dbkeys available to all users (when using the “fetch” DM) or to just that specific user’s custom version of that master dbkeys index (when creating a custom genome build).
    • For both, the fasta index is also created since the actual sequences are now available.
  • Duplicated dbkeys lead to all sorts of issues across tools and functions and are tedious to fix. If this seems to be the problem and the Galaxy install is new, is usually better to just start over with a fresh Galaxy base install. Then decide to use CVMFS, or your own local indexes, or both.

    • If a dbkey already exists, use that when fetching a new genome, or expect problems.
    • If a dbkey does not exist yet, you can create a new one for you local indexes, but it must be different from any that already exist. For brand new dbkeys, DM will run a few more steps, including updating the master dbkeys index.
  • dbkey is the technical label in files/tables for what is displayed as the database metadata in the end-user portion of the UI.

  • A known dbkey is the same reserved “key:value” pairing that data providers like UCSC also use to label specific exact assemblies. (all of it – the dbkey, the fasta title lines, AND the sequences).

  • A dbkey is what enables direct connections to/from external sites or applications. Some of those external resources also support the creation/use of custom dbkeys. So, if a key is an exact “match” between the applications is used, useful functions like dataset displays are possible – directly.

So, the specific error is produced by Galaxy and is related to data indexes. Why the two wrappers perform differently, I don’t know. But these are the relevant technical items beyond the indexes.



Hi @prao123

Maybe upgrade Freebayes to the latest version instead of some prior version?

Version 1.3.6 isn’t even hosted at some of the public servers. The most current (and recommended) version is 1.3.6+galaxy0. This is expected to be paired with the most current release of Galaxy 22.05 Releases — Galaxy Project 23.1.1.dev0 documentation.

All versions are stored in the ToolShed Galaxy | Tool Shed with links out to the development Github repository that tracks all the changes if you want to investigate more about exactly why 1.3.6 isn’t working. I don’t remember why and it may not matter if you can get the more current version installed, and it works.

Indexes can be tricky to produce directly. At a minimum, all distinct assemblies should have at least four core Data Managers applied, in order, then layer in tool-specific indexes as needed. I guessing that one of these steps was missed, or has a typo, or maybe an extra space, although the latter is much more common when NOT using Data Managers.

Did you know that indexes are available for a local Galaxy servers through CVMFS? training-material/search?query=cvmfs

The error you had can come up when the database metadata assigned to the BAM inputs (by an upstream mapping tool) is not an exact match for the build key (also called the dbkey) used to create/label all of the related built-in indexes. You can either adjust your current indexes to be consistently named, fix typos, etc – or easier – link in the CVMFS indexes. These include the full suite of hg19 and hg38 indexes (across tools) plus everything else you see indexed at usegalaxy.* sites. All is hosted from a remote file system, and just slices of data are used when called by a tool.

Hope that helps!

Hello @jennaj,
Thank you for looking in to the issue more deeply. My post should have been a bit more detailed. My FreeBayes version that was giving me issues is 1.3.6+galaxy0. I have since then downgraded to version 1.3.1. My Galaxy build is 22.05.
Additionally, I created the index files replicating the steps outlined in this post by Jennifer Hillman Jackson:
https://biostar.usegalaxy.org/p/19371/
Briefly, I ran the

  • Create DBKey and Reference Genome fetching tool
  • SAM fasta index builder
  • picard index builder
  • 2bit index builder
  • and then finally HiSAT2 index builder

making sure that the next index builder is applied only after the previous one has completed. The indexes were built on the Ensembl human genome fasta file (hg38) downloaded separately.

I believe this to be the root cause of the issue because even with the previous human genome build (hg19), the issue recurs.
I think the CVMFS may be the best route to go with.

Thank you for linking the tutorial.

1 Like

The problem is probably with the format of that fasta. Specifically, the > lines don’t match what UCSC used, so sequences lengths cannot be found. That is already part of the “comes with Galaxy” master list of dbkeys (includes length-per-identifier).

Either get that fasta

  • from UCSC with the DM tool (the original source of hg38) by starting over for that specific dbkey
  • or use CVMFS
  • or give the Ensembl version of the assembly a distinct dkbey

For any, you’ll need to strip out your existing indexes (most of that is trial and error) or start over with a fresh Galaxy install first.

This is me :sweat_smile:

Thank you @jennaj ! for that insight. I will proceed with a fresh install of Galaxy and use CVMFS for the genome indices. I think that is the best route to solve this issue.