STAR/HISAT2 aligning reads from RNA-seq fastq to intronic/unannotated regions

Aleksandar_Markovski · August 29, 2022, 7:16am

Hello all - I just got into bioinformatics and stumbled into a problem that I’ve been trying to resolve for days. I am attempting to re-analyze an RNA-seq dataset. The data is available as fastq at ebi.ac.uk, with the accession number: SRR5217848 (where I’ve downloaded it from: ENA Browser (ebi.ac.uk)). I uploaded it to Galaxy, ran FastQC, everything looked okay, ran Trimmomatic, took the paired data (as opposed to the unpaired which I understand I do not need per se for downstream analysis) and reached the point where I should align the reads to the genome. The genome I have is in .fasta format, I it downloaded from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/GCA_002082055.1.fa.gz) and also a .gtf file for annotation as to where the exons/CDSs/mRNA are, also taken from ucsc (https://hgdownload.soe.ucsc.edu/hubs/GCA/002/082/055/GCA_002082055.1/genes/GCA_002082055.1_nHd_3.1.xenoRefGene.gtf.gz). I decided to use RNA STAR from Galaxy for the alignment → For “Custom or in-built referece genome”, I chose “Use reference genome from history and build temporary index” and uploaded the .fasta genome into " Select a reference genome"; for “Build index with or without known splice junctions annotation”, I chose “build index with gene-model” and uploaded the annotated .gtf file to " Gene model (gff3,gtf) file for splice junctions". In the end when I get the result, I downloaded the RNA STAR .bam file and opened with a viewer (I’m using SeqMonk from the Babraham Institute), STAR seems to have aligned reads to the whole genome and not exclusively to the exons/CDS/mRNA which are described in the .gtf file (I am uploading a screenshot from the SeqMonk viewer). Could there be a reason for that and how can I address this problem appropriately?

Thanks a lot :)! You’re literally saving my life as this research is very important to me.

gallardoalba · August 29, 2022, 11:43am

Hi @Aleksandar_Markovski,
I don’t think there’s a problem in the alignment; STAR extracts the splice junction information from the annotation in order to improve the accuracy of the mapping, not for restricting the mapping to those regions.

Regards

Aleksandar_Markovski · August 30, 2022, 3:13am

@gallardoalba Wow, that’s very useful to know. Can I then pose two questions just to see if I understand correctly: 1) Is it then the case that simply this RNA-seq dataset seems to have a lot of reads mapping supposedly outside of the putative genes? 2) In general, is it necessary for these alignment softwares such as STAR to use a .gtf file with annotations, or is it the case that a .fasta file with the genome unannotated is sufficient (thus STAR predicts where CDSs/exons are on its own)?

gallardoalba · August 30, 2022, 7:52am

Hi @Aleksandar_Markovski, regarding the first question: yes, it is possible; some of the factors that can explain it are a incomplete gene annotation and the presence of a large number of pseudogenes. Regarding the second question, yes, it is recommended, specially for Eukaryotes, due to post-transcriptional modification.

Regards

Topic		Replies	Views
RNA Star: Can I generate a temporary index with files from previous assemblies? reference-annotation , reference-genome	2	128	May 13, 2024
Genome index STAR transcriptomics , igv	1	45	August 13, 2024
RNA star with reference genome from history error usegalaxy.eu support custom-genome , transcriptomics , reference-annotation , reference-genome , custom-build , deseq2 , rna_star	1	312	August 7, 2023
Aligning reads to custom plasmid sequence usegalaxy.eu support	2	818	March 14, 2021
How to map a collection of individual samples against a custom reference genome with RNA STAR usegalaxy.eu support troubleshooting , igv	4	29	April 6, 2025

STAR/HISAT2 aligning reads from RNA-seq fastq to intronic/unannotated regions

Related topics