I want to map my reads with Bowtie 2 to a custom reference genome within the workflow editor. I uploaded a refernce genome in .fasta and when I use Bowtie 2 as a stand-alone tool in the “Analyze Data” mode it works all fine. I select " use a built-in index" and can then select my refernce genome from the below drop-down menue. Mapping works and reads are correctly aligned to the reference.
However, when I want to integrate the Bowtie 2 alignment in my workflow and take the “use a built-in index” option, no drop down menue with my reference fasta file appears.
I checked for correct .fasta datatype and also tried “normalize fasta” as described under “Help->Support”. However, my dropdown menue never shows the respective .fasta genome in the workflow mode.
Welcome @DavidB !!
There are two ways to do this:
-
Create a Custom Build. See below for the how-to.
-
Set up the workflow’s tools to use a genome from the history. Then select the fasta at runtime (must be in the same history the workflow is launched from). This won’t work for all tools – some read an input’s the genome name (“dbkey”), assigned as the “database” attribute, instead of using a fasta from the history, making the first choice better.
Custom Builds:
A fasta Custom Genome can be upgraded to a Custom Build. Once the build is created, it is local to your account (only) and will appear at the top the list of built-in genomes in menus, including those in workflows. Do this before running the workflow – this process hasn’t been wrapped into a “tool” (yet, it probably could be).
Related FAQS:
- Preparing and using a Custom Reference Genome or Build >> https://galaxyproject.org/learn/custom-genomes/#custom-builds
- Mismatched Chromosome identifiers (and how to avoid them) >> worth reviewing. If there is a mismatch between inputs, all types of odd problems can result (actual job errors or worst, scientifically problematic results that may not be obvious to detect). The solution almost always involves fixing the data so that it matches up, then starting completely over. No one enjoys finding out that’s the problem after the prep work has already been done once.
Tips:
Decide what to name the genome/database (the “dbkey”). Make sure it is unique on the server you are using, a single “word” (no spaces), and uses only alpha-numeric characters plus underscores. Start the name with a letter.
Run the fasta genome through the tool NormalizeFasta
first, using the options to wrap at bases at 80 plus removing the description content from the “>” title line (everything after the first whitespace).
Double-check that your “chromosome” identifiers are distinct and match whatever other data inputs you may be using in the analysis (reference annotation, etc). It is very important that all inputs and jobs are based on the exact same reference genome build/format. The second FAQ linked above explains how to check.
The same process applies for transcriptomes, exomes, or really any fasta file that you want to use. Avoid fasta files with more than a few hundred records (highly fragmented assemblies) – filter by sequence length or the primary chromosomes first, as needed. Too many will cause problems with tools – usually for memory reasons during the final “coordinate-sort” step performed by default when result bam
datasets are generated.
Hope that gives some options!