RNA STAR alignment with SARS-COV-2 genome annotation - error message

Hello,
I am trying to align my rna seq data using RNA STAR and the GTF file i downloaded from NCBi. I tried a file from Genecode as well. The links are below

ftp://ftp.ebi.ac.uk/pub/databases/gencode/covid19_trackhub/data/

I keep getting the same error message:

‘’'Fatal INPUT FILE error, no valid exon lines in the GTF file: /data/dnb03/galaxy_db/files/2/f/f/dataset_2ff93edb-4e10-47af-8a34-4a2262379bde.dat
Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.

Feb 09 20:06:52 … FATAL ERROR, exiting’’’

I tried both gtf files. I made sure they were unziped before uploading. I tried removing the first 4 lines which are comments.
I was using the genome file provided by Galaxy for sars cov-2 but it doesnt have an attached gtf file so i was importing these. I am using the galaxy.eu server
The same RNA seq files I was able to align with RNA STAR to the human genome (hg19) from (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.refGene.gtf.gz) which I found on this support thread (Help for Differential Expression Analysis).

When I look at the begining of the files both the human and the covid file look the same (after i removed the comment lines). The third line in both files is “features”

Any help is appretiated!

Ermela

1 Like

Hi Ermela and welcome here,

a simple explanation for the ebi/gencode link: this file is not providing SARS-CoV-2 annotations, but annotations of Covid-19 (the disease)-related human genes.

For the NCBI annotations: I guess you should take that very first line of the error message literally - there are no exon lines in that file.
This is because SARS-CoV-2 doesn’t really have exons, but subgenomic RNAs and peptides cleaved from precursors, but RNA STAR is not prepared to handle such types of features.

In general, I’m also not sure that STAR is the right aligner for this type of data. It may well make to many splice-machinery specfic assumptions to do a good job here.
There’s also the question which kind of input data you are trying to analyze. Assigning reads to subgenomic RNA species makes most sense for long-read data.
We’ve tried this kind of thing before and actually have a public WF on usegalaxy.eu that uses minimap2 for the job: Galaxy | Europe | Published Workflow | SARS-CoV-2: map ONT reads to transcripts

2 Likes

The preprint of the work using this workflow is https://www.biorxiv.org/content/10.1101/2020.07.18.204362v1
and the minimap2 settings should essentially be those used in https://www.cell.com/cell/fulltext/S0092-8674(20)30406-2

2 Likes

Thanks for the fast response. I think youre right to some degree but I also figured out the chromosome naming convention may have been giving me an error message. I found this excerpt in the STAR manual which may help some people.

2.2.2 Which annotations to use?
The use of the most comprehensive annotations for a given species is strongly recommended. Very
importantly, chromosome names in the annotations GTF file have to match chromosome names in the
FASTA genome sequence files. For example, one can use ENSEMBL FASTA files with ENSEMBL
GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses chr1, chr2,
… naming convention, and ENSEMBL uses 1, 2, … naming, the ENSEMBL and UCSC FASTA
and GTF files cannot be mixed together, unless chromosomes are renamed to match between the
FASTA anf GTF files.

I will definitely try minimap2 first i think this may do the trick. For my experiment I have infected 293T cells with COV-2 and sent them for RNA seq. When I got the data back a lot of the reads in the infected cells did not map to the human genome (about 70% vs 5% unmapped reads in the uninfected control) I wanted to map these reads to the COV-2 genome both as a sanity check and maybe to show how infected these cells are.