RNA-STAR and hg38 GTF reference annotation

Hi, I am getting this error when trying to run RNA-STAR, how can I resolve it? Thank you.

Fatal error: Matched on FATAL ERROR

Transcriptome.cpp:48:Transcriptome: exiting because of INPUT FILE error: could not open input file /cvmfs/data.galaxyproject.org/managed/rnastar_index2/hg38/dataset_950901_files/exonGeTrInfo.tab
Solution: check that the file exists and you have read permission for this file
SOLUTION: utilize --sjdbGTFfile /path/to/annotantions.gtf option at the genome generation step or mapping step

Mar 06 00:51:59 … FATAL ERROR, exiting

1 Like

Hello,

You need to supply a reference annotation GFT dataset from the history at runtime.

The GTF should be based on the UCSC “hg38” genome build. Some choices:

  • For Gencode, copy the link to the GTF and paste it into the Upload tool. Hg38 data is here https://www.gencodegenes.org/. After it is loaded, remove the headers (lines that start with a “#”) with the Select tool using the options “NOT Matching” with the regular expression ^# . Once the formatting is fixed, change the datatype to be gft under Edit Attributes (pencil icon). The data will be given the datatype gff by default, which works fine with some tools and but not with others. Avoid the gff3 version of this particular data (contains duplicated IDs and several RNA-seq tools do not work with annotation in that format anyway).
  • For iGenomes, the archive corresponding to the target genome/build needs to be locally downloaded, the tar archive unpacked, and then just the genes.gtf data uploaded to Galaxy (browse the local file, or use FTP). Find all available genome/builds here: https://support.illumina.com/sequencing/sequencing_software/igenome.html
1 Like

I did, but it was in gtf.gz format. I will reupload in gtf format in case this was why the annotation file could not be accessed.

Thanks.

1 Like

That should work. The datatype gtf.gz is not supported.

I wondering how this was loaded – gtf data in compressed format will uncompress upon Upload when “auto-detect” is used (for “type”). And, gtf.gz cannot be assigned directly.

I’d be interested in taking a look at that dataset (even if deleted). What is the history name/dataset number? You don’t need to share the actual history link/content here.

Linking in FAQ: Extended Help for Differential Expression Analysis Tools

1 Like

I keep having the following error while running RNA Star : I cut the first five rows of the GTF file but I got the same error message as I got in last time.

Fatal error: Matched on FATAL ERROR

!!! WARNING: --genomeSAindexNbases 14 is too large for the genome size=375049285, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 13

EXITING because of FATAL ERROR in reads input: short read sequence line: 0
Read Name=@SRR7079259.22821015
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=650

Oct 13 12:15:45 … FATAL ERROR, exiting

gzip: stdout: Broken pipe

1 Like

Hi @aaak

This may be the root problem (not all jobs that were executing when the downtime started ended gracefully). Please see: UseGalaxy.org scheduled maintenance downtime October 13, 2020 – status and updates

Once the server is back up, try a rerun. If you are still having problems: Please send in a bug report, include a link to this topic in the comments, then post back here so we know when to look for it.

This could be an actual content or setting issue we can help with. But, error messages can be unreliable when a job crashes due to problematic inputs (much depends on whether the original author anticipated and trapped specific problematic use cases). Plus, given the current downtime, I wouldn’t trust any of that quite yet. I doubt this is a tool configuration problem but we can check that with your rerun sent in as a bug report. (Details: https://www.google.com/search?q=star%20"is%20too%20large%20for%20the%20genome%20size").

Thanks!

A post was split to a new topic: Human hg38 GTF, source Gencode