Asking for unmapped genes in RNA sequencing via Galaxy

Hi. I have a RNA sequencing dataset which analyzed with Galaxy usage and type of data is paired-end. The dataset has a good quality leading to ignore the trimming level. However, after mapping with STAR tool the uniquely mapped genes were just around 6%.The mapped to loci was 3.2% and unmapped due to short reads is 90%. what should I do?

Welcome @Atefeh_Bahmei

Let’s work through the problem and try to come up with some things to try. :slight_smile:

The mapping job has four key components. Double checking each is how to troubleshoot unexpected results. Would you like to review these first, then we can follow up more?

  1. Reads

    Running quality assurance tools is how to prepare this data. It sounds like you have already trimmed off any artifact and resolved other potential issues. Good! XRef → Quality Control Start Here! multQC issue and guidance?

    However, RNA Star was later reporting that the reads were “too short” to map. Was the QA too aggressive? Is there anything special about these reads? Are they usable?

  2. Reference genome (fasta)

    RNA-seq data is mapped to the same species, so make sure that your selection was correct. You might also want to review the format of the fasta dataset if it was uploaded and supplied from the history. XRef → FAQ: How to use Custom Reference Genomes?

    The quality of this assembly can also matter. If this was a model organism, the assembly is likely approaching a finished state. But for others, the status may still be in a draft state, and may be difficult to map against.

  3. Reference annotation (gff3, gtf)

    This should be based on the same assembly version as the reference genome. Why? Because the coordinates for the annotation represent positions along the bases in the assembly fasta file. Then the reads are mapped to the same fasta file. Finally, the coordinates for the annotation and where the reads mapped are compared to discover overlaps. These are the “mapped genes” statistics. If the reads and annotation do not have overlapping coordinates (along with a few other conditions mediated by the other parameters), then there will not be an assigned result.

    If the coordinate systems are mismatched, a result like your can come up!

  4. Mapping tool choice and parameters

    RNA Star is included in several Galaxy tutorials. Please find these linked at the bottom of the tool form, also here → :graduation_cap: RNA STAR: Gapped-read mapper for RNA-seq data Galaxy Training!

    RNA Star has many statistics that you can tune. An important one is on the top level of the form: Length of the genomic sequence around annotated junctions. Read the help on the form. The default is assuming a minimum sequence read length of 101 bases. Does this fit your data or do you need to adjust it?

    As a cross check, you could also try a tool like HISAT2 to see what happens! Be sure to see the tutorials for this one too since the defaults are unlikely to produce what you will want for a typical RNA-seq experiment. → :graduation_cap: HISAT2: A fast and sensitive alignment program Galaxy Training!

Then, for complete protocols, the first few tutorials here are really good and each has a workflow template that can be adapted.



So, please review your data for these, and maybe try with HISAT2 to see if you get the same result? And you are welcome to share back your history for a closer review. → How to get faster help with your question

Let’s start there! :scientist: