STAR GTF file error for newbie

Hello! I am an undergrad and am in desperate need of any assistance, as my mentor is not familiar with bioinformatics!

I am using Galaxy to do differential gene expression RNA sequencing and am following Galaxy’s tutorial (Reference-based RNA-Seq data analysis). I have 12 pairs of pair-end samples in FASTQ format of HeLa cells. I did QC and did not do cutadapt as one online source stated that it was not necessary if I would be using STAR to map my sequences. (Do you think that this was a smart decision?)

Right now, I am trying to use STAR to map my sequences to the Homo_sapiens.GRCh38.109.gtf file, but I keep getting the following error: “Fatal INPUT FILE error, no valid exon lines in the GTF file: /jetstream2/scratch/main/jobs/49599174/inputs/dataset_b28934b5-2b19-” I used a file from Ensemble and retried with one from UCSC, but both gave the same error. What should I do for my next step?

Thank you so much in advance for your help! I really appreciate it!! :slight_smile:

Hi @EriKoy
popular aligners use soft clipping (ignore unmappable nucleotides at reads’ start and end). You’ll see it in CIGAR string, eg 12S50M means 12 soft clipped nucleotides at alignment start followed by 50 matches. For additional information check specification of SAM format. You can check role of adapters and compare counts from original and trimmed reads. What important: treat all samples in the same way.

We do see elevated rate of errors when RNA_STAR is used with gene annotations. Gapped aligners can map reads across splice sites without gene annotations.

Personally, I prefer two step approach, mapping and reads counting. Some tools have built-in gene models for popular organisms 1: RNA-Seq reads to counts
Usually I import annotations, so I know what is used.
RNA_STAR provides very limited control over read counting, while featureCounts and htseq-count allow selection of attributes and features.
It is hard to say why the job failed without checking the annotation file. The annotation and the reference genome should be for the same version of genome assembly, with identical chromosome/contig names, for example, chr1, Chr1 and 1 might be considered as three different text strings/names.
Maybe try HiSAT2/featureCounts approach described in the tutorial above on one sample to see if it works for you.
Hope that helps.
Kind regards,
Igor

1 Like

Thank you so much for your help @igor! I was able to successfully map my reads using HiSAT2/featureCounts!!

I have a quick question about my MDS plot from imma. I have two replicates of each data, but the plot shows the samples to be separated from each other. Do you think that this is something that I should be concerned about?

Hi @EriKoy
MDS plots are discussed in depth in 2: RNA-seq counts to genes

Hope that helps.

Kind regards,

Igor

Dear @igor,

Thank you so much for your advice! I was able to make a volcano plot and successfully finished my senior thesis because of your help!

Thanks again, and please take care,

Erica