STAR/HISAT2 aligning reads from RNA-seq fastq to intronic/unannotated regions

Hello all - I just got into bioinformatics and stumbled into a problem that I’ve been trying to resolve for days. I am attempting to re-analyze an RNA-seq dataset. The data is available as fastq at ebi.ac.uk, with the accession number: SRR5217848 (where I’ve downloaded it from: ENA Browser (ebi.ac.uk)). I uploaded it to Galaxy, ran FastQC, everything looked okay, ran Trimmomatic, took the paired data (as opposed to the unpaired which I understand I do not need per se for downstream analysis) and reached the point where I should align the reads to the genome. The genome I have is in .fasta format, I it downloaded from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/GCA_002082055.1.fa.gz) and also a .gtf file for annotation as to where the exons/CDSs/mRNA are, also taken from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/genes/GCA_002082055.1_nHd_3.1.xenoRefGene.gtf.gz). I decided to use RNA STAR from Galaxy for the alignment → For “Custom or in-built referece genome”, I chose “Use reference genome from history and build temporary index” and uploaded the .fasta genome into " Select a reference genome"; for “Build index with or without known splice junctions annotation”, I chose “build index with gene-model” and uploaded the annotated .gtf file to " Gene model (gff3,gtf) file for splice junctions". In the end when I get the result, I downloaded the RNA STAR .bam file and opened with a viewer (I’m using SeqMonk from the Babraham Institute), STAR seems to have aligned reads to the whole genome and not exclusively to the exons/CDS/mRNA which are described in the .gtf file (I am uploading a screenshot from the SeqMonk viewer). Could there be a reason for that and how can I address this problem appropriately?

Thanks a lot :)! You’re literally saving my life as this research is very important to me.

Hi @Aleksandar_Markovski,
I don’t think there’s a problem in the alignment; STAR extracts the splice junction information from the annotation in order to improve the accuracy of the mapping, not for restricting the mapping to those regions.

Regards

@gallardoalba Wow, that’s very useful to know. Can I then pose two questions just to see if I understand correctly: 1) Is it then the case that simply this RNA-seq dataset seems to have a lot of reads mapping supposedly outside of the putative genes? 2) In general, is it necessary for these alignment softwares such as STAR to use a .gtf file with annotations, or is it the case that a .fasta file with the genome unannotated is sufficient (thus STAR predicts where CDSs/exons are on its own)?

Hi @Aleksandar_Markovski, regarding the first question: yes, it is possible; some of the factors that can explain it are a incomplete gene annotation and the presence of a large number of pseudogenes. Regarding the second question, yes, it is recommended, specially for Eukaryotes, due to post-transcriptional modification.

Regards