Hello and good day.
I am teaching students using the Galaxy server tutorial “Reference-based RNA-Seq data analysis” . One of my students intends to analyze samples from the plant species Glycine max. Since the reference genome for this species is not available in Galaxy, we downloaded it along with the GFF3 file from the Ensembl Plants website and performed the analysis. However, in the featureCounts results, the count files have a question mark in the “database” section, and the species name is not mentioned.
I wanted to check if this is not a problem and if it’s simply because the reference genome wasn’t a built-in genome available in Galaxy, so it’s okay that Galaxy didn’t recognize it? Should we have entered the species name manually while uploading the reference genome?
The ? indicates that the dataset file is not yet assigned to a specific named fasta index. This cannot be assigned during runtime when mapping but can be assigned after. The assignment can be for individual datasets or batches of datasets in a collection folder.
Then, for this part
There is an extra optional step that can be done to fully set up a Custom reference genome. We call this step a Custom Build.
How Custom Builds are assigned and used
Uploading a fasta file into your history and selecting it on tool forms is enough for some analysis. You’ll be able to visualize that single file with an automatic generation of a single use fasta index. We want this to be fast and easy!
But if you plan to visualize multiple results files together, for example, in a local IGV, this requires that each dataset file has been assigned to the same named fasta index in Galaxy. Then, in IGV, that same reference genome is set up as a a custom index. The common label is used in the applications to load data into the same genome assembly coordinate system.
You might see this called a database or dbkey across applications, but they are all the same thing: the fasta file itself and a fasta index.
This is if you want to show them how to do this! It really is very powerful and I think it helps to understand what is going on technically. If the students ever work on the command line in the future, all of this manipulation in Galaxy will be directly transferable: getting a reference genome set up in multiple tools to allow those tools to communicate.
(optional) Place a copy of the genome fasta and any reference gtf/gff3 into a dedicated history. Name the history! This makes it easier to remember what you used for your custom dbkeys, especially if you want to reuse it later on.
Create the Custom Build in Galaxy
Assign your new database dbkey to datasets
Create a genome.fasta.fai dataset in Galaxy (from the fasta dataset in your history → pencil icon, convert).
Download the genome.fasta and genome.fasta.fai datasets to your computer.
Use these two files to create the custom database in a local IGV. The option in IGV is under Genomes → Load Genome from File. Be sure to use the exact same dbkey name label as you used in Galaxy!
Then, back in Galaxy, to add a dataset to a visualization in IGV, click on the visualize icon for the dataset, and choose local IGV. The dataset will transfer over, hosted from Galaxy (no direct download step).
There are many examples of this out in the wild when working on the command line – a browser search with “custom genome IGV” will locate tutorials. But the instructions above should be enough when starting from Galaxy and doesn’t involve actually downloading all the analysis data files (just the genome, and just once).