RSeQC: Read Distribution is mostly UTR instead of exons for mRNA-seq

Hello,

I am analyzing illumina stranded mRNA library-prep data, and I aligned to the zebrafish genome DanRer11/GRCz11 from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/danRer11/bigZips/danRer11.fa.gz and cut out the alt chromosomes for just the primary assembly.
For the genome annotation file, I used a GTF file from a lab, https://www.umassmed.edu/globalassets/lawson-lab/downloadfiles/v4.3.2.gtf This GTF was specifically made for the DanRer11 assembly and is supposed to bring together the Ensemble and RefSeq annotations.

After alignment with STAR, where I provided both the genome file and the annotation file for splice junctions for making the STAR indices, I checked the quality of the alignments. Over 80% of the reads aligned:

The interesting part is when I took a look at the Read Distribution of the alignments, instead of the majority of the reads mapping to CDS exons, the expected portion (~80%) was split into 5’ UTR and 3’ UTR (~40% each). This is somewhat bizarre, and I am wondering if this is due to the GTF file not being parsed correctly?

When I took a look at the GTF file I used, the third column does use the term “exon” which is what I would expect.

Does anyone have an idea to what’s happening here? Insight much appreciated.

I am also having trouble later on in the workflow, where ‘Annotate DESeq2/DEXSeq output tables’ with the same GTF file as an input is resulting in an empty output. I’m wondering if the issues are related.

-Christine

Hi @cscho

Thanks for including that extra info. The GTF format is certainly a good place to start checking.

General format tips are here: working-with-gff-gft-gtf2-gff3-reference-annotation

What pops out to me is the sort order of the GTF file in the last screenshot. The “transcript” features should be ordered before the “exon” features associated with them, not after.

For comparison, you could try one of the GTFs that UCSC provides. All three will work with the tools you mentioned. These are ready-to-use: copy/paste a file URL into the Upload tool using all defaults. Galaxy will uncompress, assign the correct datatype, and tools will recognize the dataset as valid GTF/GFF input. Index of /goldenPath/danRer11/bigZips/genes