Too many duplicated sequences OR unassigned because of mapping quality

Hello,

I have problems in the very beginning of my RNA sequence analysis and it would be great if somebody could help.

After importing the sequence data, I did cutadapt. The result is, that I have a rate of 88 % duplicates. Here is a picture of the Sequence Duplication Levels:
duplicates

Then I did featureCounts and I got this:

mapping quality

I can’t find an explanation of the high rate of unassigned because of Multi Mapping.

After Counting and Annotation and so on, I figured out, that roughly 35% of my RNA belongs to ribosomal RNA.

Did anybody has experience with this and could help?

2 Likes

Welcome, @MelanieP

Multi-mapping is usually a problem with the annotation used during analysis, not the reads.

This was a problem introduced into the library construction prior to sequencing. Those reads will probably fall out during mapping steps in a standard transcriptomics analysis protocol.

More help for the reference annotation part, and potentially analysis methods →

For what to do about the ribosomal RNA outside of attempting to remove it during mapping, you should review scientific forums, publications, and related resources. You may have to decide whether the reads are usable at all.

1 Like

I would like to discuss a bit more about choosing the annotation, and strange sources of multi-mapping because I investigated this particular analysis. As @jennaj mentioned, the annotation can influence the featureCounts.

  • If you have a decent number of uniquely mapped reads (in RNA STAR log) but many Unassigned No feature reads in featureCounts output, then try using a different annotation file. There is a good chance that the reads are from ribosomal RNAs (rRNA) but the GTF file does not have a complete annotation of rRNAs. Sometimes, you have a better annotation of rRNAs in Refseq annotation than Ensembl or Gencode annotations.

  • In RNA-seq, you can expect ~50-60% duplication because of the reads coming from exons shared among isoforms. In your samples, the abnormally high duplication level is most likely from the high % of rRNAs.

  • There is a relationship between the multi-mapped reads and their mapping quality. Alignment programs assign a low mapping quality (MAPQ field in BAM files) to multi-mapped reads and a high value for uniquely mapped. These values depend on the aligner used. By default, RNA STAR uses a MAPQ value of 60 to indicate a uniquely mapped read, and a MAPQ of 3 to indicate a read multi-mapped to two genomic loci. This value is further decreased based on the number of hits on the reference genome. By default, in featureCounts, the Minimum mapping quality per read parameter is set to 0. Hence, every multi-mapped alignment counted as Unassigned_MultiMapping. If you set this parameter value to 10, all the alignments with MAPQ<10 are considered as Unassigned_MappingQuality in the featureCounts output. So, depending on the value set for Minimum mapping quality per read parameter, you see them as either multi-mapped or low-quality alignments which are actually the same set of multi-mapped reads.

  • Now the question is what causes multimapped reads and can we do something about it?

    • The usual cause is the reads mapping to homologous genes or repeats. These reads are generally ignored from the analysis after the counting step. This is a common approach for gene expression analysis. There are other methods to quantify isoform expression with the inclusion of multi-mapped reads.
    • In some rare cases, if reads from your top N expressed genes multi-map to a normal chromosome as well as an alt chromosome, the MAPQ is set to low and will eventually be discarded as a multimapped read in the featureCounts output. Usually, in RNA-seq, it is not essential to consider such alternative chromosomes. If you see such a high percentage of multi-mapped reads, try mapping your sample to a custom reference genome without the alt_chrs. For example, primary assembly FASTA files from Gencode can be used as a reference genome. Because of the absence of the alt chromosomes, reads that were previously multi-mapped to alt chromosomes, will be uniquely mapped to the normal chromosomes.
3 Likes