Unassigned Multimapping in featurecounts

srashid · October 31, 2019, 6:05pm

I have paired end fastq files from illumina Novaseq using whole transcriptome mRNA-seq profiling. My RNA STAR result looks OK (using hg38 gtf file from ucsc table browser).

                      Number of input reads |	38164847
                  Average input read length |	201
                                UNIQUE READS:
               Uniquely mapped reads number |	30007228
                    Uniquely mapped reads % |	78.63%
                      Average mapped length |	200.97
                   Number of splices: Total |	18124401
        Number of splices: Annotated (sjdb) |	17921546
                   Number of splices: GT/AG |	17970447
                   Number of splices: GC/AG |	117674
                   Number of splices: AT/AC |	16850
           Number of splices: Non-canonical |	19430
                  Mismatch rate per base, % |	0.19%
                     Deletion rate per base |	0.01%
                    Deletion average length |	1.73
                    Insertion rate per base |	0.01%
                   Insertion average length |	1.47
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	7150744
         % of reads mapped to multiple loci |	18.74%
    Number of reads mapped to too many loci |	62501
         % of reads mapped to too many loci |	0.16%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |	0.00%
             % of reads unmapped: too short |	2.44%
                 % of reads unmapped: other |	0.03%
                              CHIMERIC READS:
                   Number of chimeric reads |	0
                        % of chimeric reads |	0.00%

However, when I use featureCounts I get very few assigned reads, most of the reads are in Unassigned Multimapped category. (I use reverse stranded option as indicated by ‘infer experiment’)

Assigned	7712538
Unassigned_Unmapped	0
Unassigned_MappingQuality	0
Unassigned_Chimera	0
Unassigned_FragmentLength	0
Unassigned_Duplicate	0
Unassigned_MultiMapping	37553849
Unassigned_Secondary	0
Unassigned_NonSplit	0
Unassigned_NoFeatures	4138912
Unassigned_Overlapping_Length	0
Unassigned_Ambiguity	18155778

What could be the reason behind it? Is there a way to improve on that? I am losing a lot of reads due to multimapping.

I also checked the read distribution. It looks OK to me too.

jennaj · October 31, 2019, 7:55pm

Welcome @srashid

Avoid the UCSC reference GTFs from their Table Browser. These often end up truncated, plus there is a serious data content concern. Why is covered in this FAQ in more detail:

Extended Help for Differential Expression Analysis Tools

Good sources for hg38 GTF reference annotation are described in this prior Q&A (and are included in the FAQ above as well):

RNA-STAR and hg38 GTF reference annotation

The GTF should be based on the UCSC “hg38” genome build. Some choices:

For Gencode , copy the link to the GTF and paste it into the Upload tool. Hg38 data is here https://www.gencodegenes.org/ . After it is loaded, remove the headers (lines that start with a “#”) with the Select tool using the options “NOT Matching” with the regular expression ^# . Once the formatting is fixed, change the datatype to be gft under Edit Attributes (pencil icon). The data will be given the datatype gff by default, which works fine with some tools and but not with others. Avoid the gff3 version of this particular data (contains duplicated IDs and several RNA-seq tools do not work with annotation in that format anyway).

For iGenomes , the archive corresponding to the target genome/build needs to be locally downloaded, the tar archive unpacked, and then just the genes.gtf data uploaded to Galaxy (browse the local file, or use FTP). Find all available genome/builds here: iGenomes

Give one or both of those a try and see if your “Unassigned_Ambiguity” and “Unassigned_MultiMapping” counts reduce – they should (“gene_id” and “transcript_id” will no longer be the same value).

You may even get fewer “Unassigned_NoFeatures” if the UCSC data was truncated when extracted from the Table Browser.

srashid · October 31, 2019, 8:10pm

Thank you! While waiting for your reply, I actually tried igenomes gtf file, it definitely reduces ambiguity , but the multimapping issue still remains

This is the output of featureCounts now

Assigned	23399833
Unassigned_Unmapped	0
Unassigned_MappingQuality	0
Unassigned_Chimera	0
Unassigned_FragmentLength	0
Unassigned_Duplicate	0
Unassigned_MultiMapping	37557436
Unassigned_Secondary	0
Unassigned_NonSplit	0
Unassigned_NoFeatures	6348801
Unassigned_Overlapping_Length	0
Unassigned_Ambiguity	259191

jennaj · November 1, 2019, 4:19pm

Hi @srashid

FeatureCounts only reports unique matches with default settings.

Your reads are likely hitting more than one “exon”, which leads to “multimapping” counts when summarized at the Gene level.

Review the “Advanced Options”. In particular, pay attention to these parameters, but also review others and see what results. There isn’t a single right answer for everyone. It depends on how you want these counted up, if at all.

“Allow read to contribute to multiple features” (default=no)
“Largest overlap” (default=no)
“Count multi-mapping reads/fragments” (default=disabled) and the sub-option (when enabled) “Assign fractions to multimapping reads”

Thanks!

dartagnan32 · March 24, 2021, 1:31pm

Did you resolve your issue? it would be interesting to know what the solution was.

mmomeni · June 11, 2021, 12:30pm

Hi.
I have the same problem. A lot of unassigned Multimapped reads. When I select the Allow reads to map to multiple features option, My problem is fixed and more than 50% of reads are assigned. Now I wanted to ask is it scientifically and technically okay to allow reads to map to multiple features? (My aim from this analysis is to find DEGs)

David · June 11, 2021, 2:41pm

@mmomeni, you can find some discussions/methods here:

https://www.biostars.org/p/273609/

https://doi.org/10.1016/j.csbj.2020.06.014

https://www.biostars.org/p/311322/

mmomeni · June 13, 2021, 12:29pm

Thanks for introducing these discussions. I read them.
Can we use one of the tools Salmon, RSEM, or Kallisto in Galaxy for dealing with multi-mapped reads?
If the answer is yes, does any tutorials exist for that? If Galaxy has any other tool for this aim please introduce to me.

gallardoalba · June 14, 2021, 7:02pm

Hi @mmomeni,
yes, Kallisto, RSEM and Salmon are available in Galaxy. I recommed you to have a look at this tutorial in order to learn how to use Salmon for gene quantification: Quantification of gene expression: Salmon.

Regards

mmomeni · June 15, 2021, 4:17am

Thanks a lot. that was interesting tutorial and interesting tool!

Topic		Replies	Views
RNA STAR high percentage of unmapped reads: too short usegalaxy.eu support troubleshooting , mapping , blast , transcriptomics , rna_star	11	9056	January 18, 2022
Unassigned_Ambiguity problem in featureCounts usegalaxy.org support transcriptomics , rna_star	4	1645	May 10, 2021
High unassigned ambiguity counts for featureCounts data on bacterial transcriptomics usegalaxy.org support picard_markduplicates	2	1849	March 25, 2020
Too many duplicated sequences OR unassigned because of mapping quality troubleshooting , mapping , transcriptomics , tool-help , featurecounts	2	1827	March 20, 2024
RNA STAR high percentage of multi-mapped reads usegalaxy.org support blast , transcriptomics , bg_sortmerna , rna_star	7	4068	February 7, 2022

Unassigned Multimapping in featurecounts

Related topics