UCSC Reference Genome and GTF Fatal Error no valid exons in the GTF file

Good morning!
I am performing RNAstar and feature counts on the species camponotus floridanus. I have followed the tutorial training for RNAseq analysis of d.mel provided by Galaxy, and I had no issues. In this tutorial, both the Refernce Genome and the GTF were provided to us. We are told that if we need to find these files for a different species we can use the UCSC tool to import GTF and the genome FASTA into galaxy.

I have begun the process of RNAseq analysis using the species C.flo, which through accessing its genome in NCBI, I have been able to access in UCSC tool. I have been able to import both the GTF file and the reference genome (FASTA) into Galaxy. Here is what the corresponding file looks like :

For the fasta Genome -

For the GTF -

When I run RNAstar using the ‘use reference genome from history and create temporary index’ then choose the c.flo fasta file (imported from UCSC) and select build index with gene model, choosing the GTF file (from UCSC), it shows this error :

I am unsure how to proceed. I repeated this method using d.mel downloads from UCSC and it produced the same error, so I know that it is not due to my species of interest.

  • Am I performing the RNAstar wrong?
  • Is there actually a difference in the naming of my chromosomes between files, and if so how do I fix it? I have scoured the internet and Galaxy trying to find ways to fix it but have been running into dead end after dead end for hours on end.
  • For the section ‘use reference genome from history and create temporary index’, am I supposed to use a different file than the one downloaded from UCSC, or am I supposed to manipulate it so that it can be used?
    I would really appreciate some advice as I am able to do everything except this step, which is causing me difficulties. I have been able to perform RNAstar with just the reference genome, but when I proceed to Featurecounts, it runs but produces an output of all my genes having zero reads, and a summary of 0% alligned reads :

    Please help!
    It would be very much appreciated :slight_smile:

Hi @drcottoncandy

Thank you for sharing so many details! Very helpful.

For the immediate problem, the problem is likely with the fasta data formatting. Notice how the > title lines include description content after the identifier. STAR is really picky about the format and cannot isolate the chromosome identifier, which means it cannot “match it up” with the same chromosome identifier in the GFT data, then no lines are matched, so no exon lines are found.

Try this

  • Run the tool NormalizeFasta on the custom genome fasta, being sure to check the box to remove everything after the first whitespace on the > title lines.

Optional

  • I see a database assignment applied to your datasets. Database keys are fasta indexes.
  • Did you create your own custom database key already? If so, you will need to delete the old one and recreate it with the normalized fasta file.
  • In general is is not a good idea to assign a server database key to custom data since there could be small differences (mismatched fasta indexes).
  • More details FAQ: How to use Custom Reference Genomes?

There are some important details I can’t see yet – the GTF format in the other columns specifically. The UCSC Table Browser will not output all of the data values that are sometimes needed. Instead, you will want the GTF from their downloads area (if available, sometimes you can request this, too) or from NCBI. This guide includes the details: FAQ: Extended Help for Differential Expression Analysis Tools

This recent topic has an example of getting the data from NCBI instead:

Let’s start there! Please try to get the fasta format simplified then we can help with the GTF data as needed. Thanks! :slight_smile:

Hi jennaj,
Thank you so much for the incredible help!
Following the Lettuce History I was able to download the proper files for the reference genome and the GTF, and sucessfully run RNAstar with these files. I think the issue was the original file formatting.
Also, when I proceeded to FeatureCounts using the CORRECT files, I changed the GFF filter from exon to gene and it increased my % assignment from ~50 % to ~60 %, when previously with the incorrect files it had been ~13%.
Thank you for the quick response and great advice, I truly appreciate it. Galaxy is a really nice platform and the user support is quick and VERY helpful.

1 Like

Great, I’m so glad you were able to get this going and importantly get the results you needed! And thank you for the kind comments! :star_struck: