Reference genomes at public Galaxy servers
Human example → Homo Sapiens GRCh38 hg38
The version of hg38 hosted as a native built-in index at UseGalaxy servers was sourced from UCSC.
- Genome and Table Browser Dec. 2013 (GRCh38/hg38)
- UCSC Genome Browser Downloads #human
Other data provider sources of GRCh38/hg38 can be slightly different.
- Alternative chromosome naming → The same underlying genomic assembly with potentially different data labels. This is the most common situation, and our FAQs explain what to consider plus what can be done. Example: tabular/GTF/BED data with alternative naming labels can be adjusted (tool: Replace column).
- Patch releases → This is considered a unique genome assembly for most use cases. Annotation for a patch assembly might work with the primary assembly but don’t attempt to assign the hg38 database to the GTF, and the excess contig feature annotation will just be ignored by some tools. Getting errors? Find a different file that is an actual match.
Choice 1: native reference genome, native database key, and user supplied reference annotation
UCSC hosts GTF reference annotation in their Downloads area from a few different gene annotation tracks based on their hg38 assembly. These would work with tools without extra manipulations (copy URL + Upload with auto detect defaults == ready to use). This would allow you to assign the hg38 database metadata key to any datasets based on the same assembly version (basepairs) and labeling (identifiers). This would also allow you to link out to more display applications like IGV, UCSC, and others “automatically”.
Other data sources are possible … but might need minor labeling or format adjustments. Gencode hosts annotations that are based on this same hg38 base assembly. Removing the # header lines is a good idea.
Choice 2: custom reference genome, custom build database key, and user supplied reference annotation
You can load up the reference genome (fasta) for the assembly version you want to use instead. Set it up as a custom genome, and use a paired annotation with it. This could include creating a “custom build” database metadata key for the genome assembly. External display applications often have their own custom genome functions. Set up your genome up in Galaxy, set it up in another application … then moving data between the two will work. The data preparation is similar for custom transcriptome reference assemblies.
References
Start with these Galaxy Training Network (GTN) resources
-
Custom Reference Genome/Build → FAQ: How to use Custom Reference Genomes?
-
Reference Annotation overview → FAQ: Working with GFF GFT GTF2 GFF3 reference annotation
-
Reference Data practicals → https://training.galaxyproject.org/training-material/faqs/galaxy/analysis_differential_expression_help.html. Help in here has a bit more context about what to pay attention to when getting reference datasets prepared. See this even when differential expression is not your exact analysis domain.
-
Reference Data mismatches → FAQ: Mismatched Chromosome identifiers and how to avoid them. Data mismatches are similar to bad reagents in a wet lab experiment: all sorts of odd problems can come up! Pick new file “reagents” or use your data scientist skills to fix things up → https://training.galaxyproject.org/training-material/topics/introduction/tutorials/data-manipulation-olympics/tutorial.html.
-
The wider bioinformatics community has many comments on this topic too! A quick internet search will find discussion, and you could start with these examples.
Q&A
How can I use patch assemblies?
We won’t be hosting patch versions of assemblies as native indexes, or at least not anytime soon. Perhaps in the future. But please know that anything you can do with an index we happen to host, you can probably also do with the custom functions.
When will T2T assemblies be available?
The human, Home Sapiens T2T assembly is natively indexed for some tools, and UCSC has converted the chromosome coordinates for many data tracks over to the new assembly.
These are mostly the same right now
- UCSC → Jan. 2022 (T2T-CHM13 v2.0/hs1)
- UseGalaxy → CHM13_T2T_v2.0
Using this reference genome should be approached with caution and knowledge about potential analysis impacts. The nature of the assembly exposes repetitive and other previously unexposed genomic regions (telomeres). This will lead to scientific difficulties with many existing tools and protocols.
You are free to explore. Should problems come up, including what may seem to be processing errors but are actually scientific data issues, switching to use hg38 is the recommended the solution. As the scientific community develops T2T specific analysis tools and/or parameters over time, Galaxy will incorporate those resources.
FAQ: What information should I include when reporting a problem?
Any persistent problems can be reported in a new question for community help. Be sure to provide enough context so others can review the situation exactly and quickly offer advice.
Consider https://training.galaxyproject.org/training-material/faqs/galaxy/histories_sharing.html or posting content from the Job Information view as described in https://training.galaxyproject.org/training-material/faqs/galaxy/analysis_troubleshooting.html.