Welcome, @Francisco_Hernandez1
For this immediate problem, the annotation dataset probably has a “datatype” of gff
assigned. This is due to the extra header lines in the data from the source. Header are not in the specification for gtf
format. The headers need to be removed and the datatype gtf
assigned (or better, use “redetected” to ensure the format is an actual match) under pencil icon > Edit Attributes > Datatypes.
How to is cover in this prior Q&A: RNA-STAR and hg38 GTF reference annotation. That same help could be applied to gtf
annotation from other sources that have a header and/or internal extra lines that are out of specification.
Few other problems could be going on:
1 – The mouse genome assembly from Ensembl has chromosome names that differ from some other sources. Is there a reason why you are not using the gtf
also available from this same source? That would ensure a match.
2 – The mouse annotation from Gencode has chromosome names that are a match for UCSC’s genome assembly mm10. The mm10 build is indexed natively at the Galaxy EU https://usegalaxy.eu server for RNA Star
and most other tools. Plus, if you mapped against the mm10 pre-existing genome build’s index, your mapping jobs will run faster and be less likely to fail for resources. RNA Star
is a memory-intensive tool. You can still incorporate annotation.
3 – If there is a mismatch between identifiers in inputs, any tool, expect problems. In this case, if I am understanding correctly, the reference fasta
(assembly) and the reference gtf
(annotation) are not a match. That means the annotation is not being incorporated into the analysis correctly. A chromosome naming mismatch wouldn’t necessarily produce an error (it depends on the tool) – but rather results that do not actually make use of the annotation. That may not be obvious/easy to detect, especially at the mapping step.
4 – If you still plan on using the Ensembl build, make sure your fasta is formatted correctly. Specifically, description content in the “>” title line is a problem for Custom genomes/builds. Also, promoting the Custom genome to a Custom built might be needed if you use a tool that requires a “database” assignment. Definitely don’t assign the “mm10” database to a fasta or other dataset that is based on a non-mm10 formatted data, or expect problems. This usually won’t present as an issue during mapping, but with downstream tools/steps, requiring you to start over or to do some manipulations, when possible, with a tool like: Replace column by values which are defined in a convert file
.
5 – When loading data with the Upload
tool, using “autodetect” for the datatype is almost always the best choice, and definitely the best choice for the most common datatypes.
6 – Avoid gff3
annotation format when possible. Not all tools work with it – and you want to be sure to use the same genome and annotation build/version/data for all steps in an analysis path.
These FAQs cover more details for the above: https://galaxyproject.org/support/