I have paired end fastq files from illumina Novaseq using whole transcriptome mRNA-seq profiling. My RNA STAR result looks OK (using hg38 gtf file from ucsc table browser).
Number of input reads | 38164847
Average input read length | 201
UNIQUE READS:
Uniquely mapped reads number | 30007228
Uniquely mapped reads % | 78.63%
Average mapped length | 200.97
Number of splices: Total | 18124401
Number of splices: Annotated (sjdb) | 17921546
Number of splices: GT/AG | 17970447
Number of splices: GC/AG | 117674
Number of splices: AT/AC | 16850
Number of splices: Non-canonical | 19430
Mismatch rate per base, % | 0.19%
Deletion rate per base | 0.01%
Deletion average length | 1.73
Insertion rate per base | 0.01%
Insertion average length | 1.47
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 7150744
% of reads mapped to multiple loci | 18.74%
Number of reads mapped to too many loci | 62501
% of reads mapped to too many loci | 0.16%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 2.44%
% of reads unmapped: other | 0.03%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
However, when I use featureCounts I get very few assigned reads, most of the reads are in Unassigned Multimapped category. (I use reverse stranded option as indicated by âinfer experimentâ)
Assigned
7712538
Unassigned_Unmapped
0
Unassigned_MappingQuality
0
Unassigned_Chimera
0
Unassigned_FragmentLength
0
Unassigned_Duplicate
0
Unassigned_MultiMapping
37553849
Unassigned_Secondary
0
Unassigned_NonSplit
0
Unassigned_NoFeatures
4138912
Unassigned_Overlapping_Length
0
Unassigned_Ambiguity
18155778
What could be the reason behind it? Is there a way to improve on that? I am losing a lot of reads due to multimapping.
I also checked the read distribution. It looks OK to me too.
Avoid the UCSC reference GTFs from their Table Browser. These often end up truncated, plus there is a serious data content concern. Why is covered in this FAQ in more detail:
Good sources for hg38 GTF reference annotation are described in this prior Q&A (and are included in the FAQ above as well):
Give one or both of those a try and see if your âUnassigned_Ambiguityâ and âUnassigned_MultiMappingâ counts reduce â they should (âgene_idâ and âtranscript_idâ will no longer be the same value).
You may even get fewer âUnassigned_NoFeaturesâ if the UCSC data was truncated when extracted from the Table Browser.
Thank you! While waiting for your reply, I actually tried igenomes gtf file, it definitely reduces ambiguity , but the multimapping issue still remains
FeatureCounts only reports unique matches with default settings.
Your reads are likely hitting more than one âexonâ, which leads to âmultimappingâ counts when summarized at the Gene level.
Review the âAdvanced Optionsâ. In particular, pay attention to these parameters, but also review others and see what results. There isnât a single right answer for everyone. It depends on how you want these counted up, if at all.
âAllow read to contribute to multiple featuresâ (default=no)
âLargest overlapâ (default=no)
âCount multi-mapping reads/fragmentsâ (default=disabled) and the sub-option (when enabled) âAssign fractions to multimapping readsâ
Hi.
I have the same problem. A lot of unassigned Multimapped reads. When I select the Allow reads to map to multiple features option, My problem is fixed and more than 50% of reads are assigned. Now I wanted to ask is it scientifically and technically okay to allow reads to map to multiple features? (My aim from this analysis is to find DEGs)
Thanks for introducing these discussions. I read them.
Can we use one of the tools Salmon, RSEM, or Kallisto in Galaxy for dealing with multi-mapped reads?
If the answer is yes, does any tutorials exist for that? If Galaxy has any other tool for this aim please introduce to me.