I’m performing a meta-analysis across multiple human RNA-seq datasets. Given that my STAR alignment rate is >70% but featureCounts assigns <50% with many ‘no feature’ reads , I am confident about the strandness setting, and I used the same GTF file for both STAR and featureCounts. these datasets worked on mRNA. What could be the reason for this discrepancy? can I proceed with these data as they are, or should I make changes ? What best practices would you recommend for meta-analysis in this situation?
Welcome @nmozdoori
It sounds like the reads are mapping to the reference genome, but not overlapping with the target annotation features (with the current annotation choice and parameter settings).
The Featurecounts reports are usually a good starting place as a diagnostic tool for the “why”. This is a good discussion topic here about interpretation with link outs to publications and other public discussions out in the wild.
More ideas
- Is RNA Star the best mapping tool choice? HISAT2 is a common alternative
- Is Featurecounts the best counting tool choice? HTSeq count is one alternative, and Stringtie, Salmon, and Kalisto are others.
- The workflow can include a combination of tools, too! We have examples in our tutorials. Please don’t be put off by the “introduction” tone of these, they are actually quite sophisticated and each includes a workflow template you could adopt and customize!
- Most similar to what you are already doing. Note the parameters changed away from the defaults. → Hands-on: Reference-based RNA-Seq data analysis / Reference-based RNA-Seq data analysis / Transcriptomics
- Advanced! → Hands-on: Genome-wide alternative splicing analysis / Genome-wide alternative splicing analysis / Transcriptomics
- Then, for polished production workflow, explore here. → https://iwc.galaxyproject.org/
- This one offers alternative statistical and visualization methods. → IWC RNA-Seq Analysis: Paired-End Read Processing and Quantification
I would suggest choosing a few samples to explore closer. Process the data through mapping, then branch off into the different processing (and annotation!) choices and compare to make decisions about how to process the full batch.
You didn’t mention which annotation you are using but Gencode is usually a good baseline to compare to if you are currently using something different. You could get both the GRCh38/hg38 genome and the annotation from the same source, reduce both down to the primary autosomes chr1-22, then chrM and chrX. Leaving out chrY (PAR complications) and all the haplotype and other fragments can reduce multi-mapping issues, at least for exploratory reasons.
Example history with those versions prepared for use with tools (see the hidden datasets for the methods) → https://usegalaxy.org/u/jen-galaxyproject/h/gencode-grch38-hg38-human-female
Hope this helps! You should be able to do all of this in Galaxy, and most with a workflow. This was a good question!