RNA Star does not recognize ENCODE gtf files

Hi there:
Dear colleagues I have been using RNA Star during the last year many times. Within the last week I have done several alignment by using mouse genome from ensembl (http://www.ensembl.org/Mus_musculus/Info/Index and mouse annotation) and annotation from genecode (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz). I allways have added both files to my history before performming the alignmetn. During RNA Star program parameters selection, gtf files always were recognized. But since yesterday, this is not possible. Gtf files are present in my history but when I try to select them to create a temporary index, in “Gene model (gff3,gtf) file for splice junctions” section, they are not selectable and , therefore, they can not be loaded.
What has happed? Have you modified parameters sellection section? What I am doing wrong?

Welcome, @Francisco_Hernandez1

For this immediate problem, the annotation dataset probably has a “datatype” of gff assigned. This is due to the extra header lines in the data from the source. Header are not in the specification for gtf format. The headers need to be removed and the datatype gtf assigned (or better, use “redetected” to ensure the format is an actual match) under pencil icon > Edit Attributes > Datatypes.
How to is cover in this prior Q&A: RNA-STAR and hg38 GTF reference annotation. That same help could be applied to gtf annotation from other sources that have a header and/or internal extra lines that are out of specification.

Few other problems could be going on:

1 – The mouse genome assembly from Ensembl has chromosome names that differ from some other sources. Is there a reason why you are not using the gtf also available from this same source? That would ensure a match.

2 – The mouse annotation from Gencode has chromosome names that are a match for UCSC’s genome assembly mm10. The mm10 build is indexed natively at the Galaxy EU https://usegalaxy.eu server for RNA Star and most other tools. Plus, if you mapped against the mm10 pre-existing genome build’s index, your mapping jobs will run faster and be less likely to fail for resources. RNA Star is a memory-intensive tool. You can still incorporate annotation.

3 – If there is a mismatch between identifiers in inputs, any tool, expect problems. In this case, if I am understanding correctly, the reference fasta (assembly) and the reference gtf (annotation) are not a match. That means the annotation is not being incorporated into the analysis correctly. A chromosome naming mismatch wouldn’t necessarily produce an error (it depends on the tool) – but rather results that do not actually make use of the annotation. That may not be obvious/easy to detect, especially at the mapping step.

4 – If you still plan on using the Ensembl build, make sure your fasta is formatted correctly. Specifically, description content in the “>” title line is a problem for Custom genomes/builds. Also, promoting the Custom genome to a Custom built might be needed if you use a tool that requires a “database” assignment. Definitely don’t assign the “mm10” database to a fasta or other dataset that is based on a non-mm10 formatted data, or expect problems. This usually won’t present as an issue during mapping, but with downstream tools/steps, requiring you to start over or to do some manipulations, when possible, with a tool like: Replace column by values which are defined in a convert file.

5 – When loading data with the Upload tool, using “autodetect” for the datatype is almost always the best choice, and definitely the best choice for the most common datatypes.

6 – Avoid gff3 annotation format when possible. Not all tools work with it – and you want to be sure to use the same genome and annotation build/version/data for all steps in an analysis path.

These FAQs cover more details for the above: https://galaxyproject.org/support/

Hi Jenna.
Thanks for your reply. It seems to me that I have not explained the problem properly. I have always worked with gtf files, neither gff nor gff3. I have used them with RNA Star previously, either with natively indexed mm10 build at the Galaxy EU or Ensembl build in my history without any problem. But sice Sunday ther is no way to select any gtf file from my history to be used with RNA Star. The files are there, in my history, but when I try to selec them it is imposible since they are not selectable in “Gene model (gff3,gtf) file for splice junctions” step.
I attach

a picture in order to illustrate the problem. Green arrows indicate gtf files in my history. Red arrow indicates where the problem is. Although there are severals gtf files in the history, Galaxy does not recognize them (No gff3 or gtf dataset available) and , therefore, alingment can not be acommplished.

1 Like

Right, and this is probably because the assigned datatype is gff and not gff3 or gtf.

Expand the dataset to see the what datatype was assigned during Upload. Then make format adjustments as needed and “redetect” the datatype. You’ll be expecting gtf to be assigned once the re-formatting is done correctly for these data. The name of the dataset is just a label – the assigned metadata is what matters and it must be a match for the actual data content: How do I find, adjust, and/or correct metadata?

This data will be a mismatch for Ensembl based genomes. Review the chromosome names and you’ll see the difference. Tools will not necessarily fail, but the result content will be “off” scientifically.