I am analyzing illumina stranded mRNA library-prep data, and I aligned to the zebrafish genome DanRer11/GRCz11 from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/danRer11/bigZips/danRer11.fa.gz and cut out the alt chromosomes for just the primary assembly.
For the genome annotation file, I used a GTF file from a lab, https://www.umassmed.edu/globalassets/lawson-lab/downloadfiles/v4.3.2.gtf This GTF was specifically made for the DanRer11 assembly and is supposed to bring together the Ensemble and RefSeq annotations.
After alignment with STAR, where I provided both the genome file and the annotation file for splice junctions for making the STAR indices, I checked the quality of the alignments. Over 80% of the reads aligned:
The interesting part is when I took a look at the Read Distribution of the alignments, instead of the majority of the reads mapping to CDS exons, the expected portion (~80%) was split into 5’ UTR and 3’ UTR (~40% each). This is somewhat bizarre, and I am wondering if this is due to the GTF file not being parsed correctly?
When I took a look at the GTF file I used, the third column does use the term “exon” which is what I would expect.
Does anyone have an idea to what’s happening here? Insight much appreciated.
I am also having trouble later on in the workflow, where ‘Annotate DESeq2/DEXSeq output tables’ with the same GTF file as an input is resulting in an empty output. I’m wondering if the issues are related.