How can I incorporate an annotated bacterial genome for RNA sequencing analysis using galaxy?
Hi @Luisa_Nieto,
could you provide some additional details about that question?
Regards
Yes, I am using bwa tool and the reference genome that I need to use is not included. I used the fasta format and the auto option for the algorithm for constructing the BWT index (as one of the options suggested). However, I am having difficulties with the featurecount tool downstream. Therefore, I would like to know if it is possible to follow the recommendation #1 given (to see if I can improve the downstream analysis):
" 1. Contact galaxy team using Help->Support link at the top of the interface and let us know that an index needs to be added"
Hi @Luisa_Nieto
Thanks for explaining
Use a custom genome fasta and optionally promote that to a custom build to create a “database” metadata that be assigned to datasets. Some tools use the fasta directly and some tools require/use the “database” metadata. Tip: avoid assigning an existing “database” to any data that is associated with a custom genome fasta – instead create your own custom database key to avoid content conflicts that can be difficult to interpret.
The general flow is usually:
- Upload the genome
fasta
to Galaxy - Upload the genome annotation to Galaxy (
GTF
is usually best) - Standardize the format of both
- make sure the “chromosome” IDs are an exact match between the fasta and GTF
- remove any extra content from the fasta
>
title line - remove any
#
header lines from theGTF
(andGFF3
would have at least one)
- Create the custom build “database” for the standardized fasta
- Done and ready to start an analysis project
These FAQs explain the “how to”
- reference-genomes
- adding-a-custom-database-build-dbkey
- working-with-fasta-datasets
- working-with-reference-annotation
You can also search this forum with keywords to find prior Q&A about troubleshooting problems. I also added a few tags that do the same. You’ll notice that most of those involve format problems and link back to those FAQs or variations of them. Getting both of these reference files prepped and ready to use, at the very start and before starting the actual analysis, will make life happier
We usually only index reference genomes and not reference annotation since the latter changes so frequently. The problem is likely with the current format of your fasta or GTF or both. The help above usually solves those kinds of errors. The custom genome functions works great for bacterial sized (or smaller) genomes, and sometimes is better than a native index. Why? It is easier to compare the IDs when both datasets are in a history (and not behind an index) and it is something you can do immediately without waiting for the public sites to be updated.
So, please give that a try, and if the errors persist, you can post back a shared history link that contains both reference datasets and the error data (inputs and outputs undeleted) and we can try to help more. That can be public as a reply, or ask for a private message chat to be started up and you can share the history URL in that.
Note: Larger genomes (mammalian, some plants) do benefit from server-side indexes – and we would need a link to the data source for those. The associated annotation data would still be uploaded/standardized independently in the working history by you. Any annotation data that can be indexed on the server is probably already available (rare, and only for a few tools).
The tool choice could also be a problem – but not necessarily since you are working with a bacterial genome.
General usage:
BWA
is for mapping unspliced reads (WGS, ChIP-seq, others)HISAT2
andBowtie2
are for mapping spliced reads (RNA-Seq)Featurecounts
is for counting up the abundance of transcriptome features fromRNA-seq
experiments (mostly).
For examples, including workflows, please see our tutorials here