After using RNA STAR on 6 rat RNA-seq samples only about 50-60% of reads were mapped uniquely with 30-40% of reads mapped to multiple loci. Is there a way to improve mapping quality or determine what sort of contamination there might be?
Is there a possibility that the issue is from the gene annotation file? My first attempt at mapping was using a .gtf annotation file from ensembl but it didn’t work. Instead the one I got from UCSC main worked. My hunch is that it was just a formatting issue.
Ensembl source - https://useast.ensembl.org/Rattus_norvegicus/Info/Index
UCSC source - UCSC Genome Browser Downloads
Here is the RNA STAR log for one of the samples.
Started job on | Jan 29 08:14:51
Started mapping on | Jan 29 08:36:14
Finished on | Jan 29 12:26:14
Mapping speed, Million of reads per hour | 6.18
Number of input reads | 23700301
Average input read length | 74
UNIQUE READS:
Uniquely mapped reads number | 13214262
Uniquely mapped reads % | 55.76%
Average mapped length | 73.63
Number of splices: Total | 2475399
Number of splices: Annotated (sjdb) | 2413443
Number of splices: GT/AG | 2408845
Number of splices: GC/AG | 22608
Number of splices: AT/AC | 3324
Number of splices: Non-canonical | 40622
Mismatch rate per base, % | 0.71%
Deletion rate per base | 0.02%
Deletion average length | 1.29
Insertion rate per base | 0.02%
Insertion average length | 1.43
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 9233338
% of reads mapped to multiple loci | 38.96%
Number of reads mapped to too many loci | 146819
% of reads mapped to too many loci | 0.62%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 1075213
% of reads unmapped: too short | 4.54%
Number of reads unmapped: other | 30669
% of reads unmapped: other | 0.13%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Dear @ryzhu,
Check with FASTQC if there is any adapter contamination still in your data. Furthermore, it might be worth to extract your multi-mapped reads and use featurecounts or blast to see if these come from rRNA (using a gff/gtf for featurecounts or a rRNA sequence file for blast). High ratio of muli-mapped reads in RNA-seq is often a sign of an incomplete depletion of rRNA.
Thanks for the reply. I was able to get rid of the adapter contamination using CutAdapt.
After using FastQC on the trimmed data I used blast on the reads from the “Overrepresented sequences” section of the report and most of them came from rRNA.
I then filtered the mapped data using Samtools view to filter for reads with the “alignment of the read is not primary” flag set. I used blast on several of these sequences but did not find them to be from rRNA. Is there a higher throughput method to check if the reads come from rRNA then just using blast one by one?
Given that the many of the “Overrepresented sequences” came from rRNA I feel like incomplete depletion of rRNA is part of the issue.
Proceeding from here, my next step is to use featurecounts. Should I run RNA STAR to map differently or proceed with featurecounts and just disregard the multi-mapped reads? Thank you so much.
Am I correct in thinking that my goal here is to remove the multi mapped reads that come from rRNA?
I have been able to extract the multi mapped reads from my bam file and converted the sequences into FASTA file format. I am not sure how blast would be helpful as I don’t think it would output anything that would help me be able to remove the multi mapped reads that come from rRNA. Furthermore, am I able to find a rat specific rRNA database to use with blast?
Dear @ryzhu,
Yes, you might need to remove the multi-mapped reads in correspondence to mapping to rRNA. I am not familiar with rat. Maybe use Ensembl and pick a rat annotation (GTF/GFF3) and filter it for rRNAs.
Also, the tool that Björn has suggested might help.
Jumping in here and may be out of my lane but the idea was to try to find out what those multi-mapped reads were. If you have confirmed that those are excessive rRNA due to some library prep problem (failed depletion), then that is your answer. The reads are not useful and they will fall out of the analysis after the mapping step.
You will probably also lose more during the quantification (“counting”) steps. Most reference annotations (GTF, etc) do not include rRNA. There are some annotation sources (and tools to parse) that do specifically focus on rRNA as @Flow and @bjoern.gruening have described, but that is a data cleaning/checking step for your case, distinct from the primary analysis.
In short, you’ll need to decide if the data is of sufficient quality to continue using. If you decide to proceed, then the reads falling out of the primary analysis at different steps is likely “enough” to exclude them. Or, you could specifically remove those as well, and compare the two approaches. I suspect they will closely match unless more is going on.
The depth of sequencing usually takes into account the expected coverage of known or targeted transcription events. If ~ 40% of your reads are discarded because of a contamination issue, then that is roughly how many valuable transcription events that may have been missed. The contamination reads are not informative. Any other reads that might have been generated in the absence of contamination, to represent or add to confirmation for events, are not available – because they were never sequenced.
Maybe this helps, and others can correct me of course