I tried to run RNA STAR to map paired-end reads onto a genome.
I got many unmapped reads described as “too short”, see the report below:
Number of input reads | 35461958
Average input read length | 269
UNIQUE READS:
Uniquely mapped reads number | 12102031
Uniquely mapped reads % | 34.13%
Average mapped length | 274.83
Number of splices: Total | 8397273
Number of splices: Annotated (sjdb) | 8344187
Number of splices: GT/AG | 8236046
Number of splices: GC/AG | 66713
Number of splices: AT/AC | 3518
Number of splices: Non-canonical | 90996
Mismatch rate per base, % | 1.00%
Deletion rate per base | 0.04%
Deletion average length | 2.50
Insertion rate per base | 0.03%
Insertion average length | 2.89
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 5260654
% of reads mapped to multiple loci | 14.83%
Number of reads mapped to too many loci | 3293617
% of reads mapped to too many loci | 9.29%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 41.43%
% of reads unmapped: other | 0.32%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
I also tried to run Hisat2 which indicates that many reads are unpaired:
Do you have any idea what could have caused this?
I was wondering whether perhaps the fact to concatenate the files from different libraries with cat function leads to a problem.
But I have done it in the past and it was never the case.
Or perhaps I should sort the fastq files?
Please let me know if you have any idea how to solve the problem.
The reads should not be too short as they are between 75 and 150 bp.
Dear @Victoria_Moris,
Something might be troublesome with your input or the parameter setting of the mappers. You can follow this post about the issue, when you face too many short reads in RNA STAR. Too short in the context of RNA STAR does not mean the input read is too short. It means the alignment of RNA STAR is just too short and thus the read is filtered out. So optimizing the parameters mentioned in the post might help.
Regarding HISAT2, maybe you should check if your R1 and R2 file really have the same order of read ids. I think you do not have to sort the fastq files, butperhaps you did a mistake in the concatenation of the two files.
Thank you for your reply!
I already saw this post and tried RNA STAR with these parameters: --outFilterScoreMinOverLread --outFilterMatchNminOverLread set at 0.3 and 0 instead of 0.6.
When they were set at 0 I got these results:
Number of input reads | 35461958
Average input read length | 269
UNIQUE READS:
Uniquely mapped reads number | 19338611
Uniquely mapped reads % | 54.53%
Average mapped length | 197.96
Number of splices: Total | 8416774
Number of splices: Annotated (sjdb) | 7602815
Number of splices: GT/AG | 8327063
Number of splices: GC/AG | 20483
Number of splices: AT/AC | 2992
Number of splices: Non-canonical | 66236
Mismatch rate per base, % | 1.62%
Deletion rate per base | 0.04%
Deletion average length | 2.66
Insertion rate per base | 0.04%
Insertion average length | 3.13
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 12492303
% of reads mapped to multiple loci | 35.23%
Number of reads mapped to too many loci | 3597444
% of reads mapped to too many loci | 10.14%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 0
% of reads unmapped: too short | 0.00%
Number of reads unmapped: other | 33600
% of reads unmapped: other | 0.09%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
There are still many reads which do not map.
and If I put the value of the filters at 0.3 :
Number of input reads | 35461958
Average input read length | 269
UNIQUE READS:
Uniquely mapped reads number | 13796944
Uniquely mapped reads % | 38.91%
Average mapped length | 258.88
Number of splices: Total | 8415901
Number of splices: Annotated (sjdb) | 7602278
Number of splices: GT/AG | 8326287
Number of splices: GC/AG | 20475
Number of splices: AT/AC | 2989
Number of splices: Non-canonical | 66150
Mismatch rate per base, % | 1.17%
Deletion rate per base | 0.04%
Deletion average length | 2.66
Insertion rate per base | 0.04%
Insertion average length | 3.12
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 7943871
% of reads mapped to multiple loci | 22.40%
Number of reads mapped to too many loci | 3355078
% of reads mapped to too many loci | 9.46%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 10332465
% of reads unmapped: too short | 29.14%
Number of reads unmapped: other | 33600
% of reads unmapped: other | 0.09%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
I do not think that is the main issue because of the report given by Hisat2.
I think indeed something happened perhaps when I concatenate the two files.
I will try to repeat this step or perhaps map one file from a single library.
Thank you for your help.
Once I have the results, I will indicate here if it solved the problem or not.
Hi Victoria,
Parameters starting with --outFilter do not influence the mapping process but only affect what you see in the output file. So in the end, by decreasing the values for these parameters, you are checking whether they are at least some very short alignments. Based on your read lengths, smaller values make only sense if your sequencing library mostly contains small RNAs that are shorter than 50 bases. With your adjusted parameters (value of 0) you could align 20% more reads uniquely which may indicate that you have short RNA fragments for some reason.
The next step would be to investigate what those unmapped reads are. You can use “Filter sequences by mapping” tool to output just the unmapped reads. This gives a FASTQ file. Then convert this into a FASTA file and do a random subsampling of 1000 reads. In the end, use blastn. Maybe it will lead to some clue.
I ran RNA STAR and Hisat2 with the files before I concatenate them and have even worst results:
Number of input reads | 9136914
Average input read length | 263
UNIQUE READS:
Uniquely mapped reads number | 1816200
Uniquely mapped reads % | 19.88%
Average mapped length | 279.60
Number of splices: Total | 948995
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 937368
Number of splices: GC/AG | 2143
Number of splices: AT/AC | 154
Number of splices: Non-canonical | 9330
Mismatch rate per base, % | 1.39%
Deletion rate per base | 0.04%
Deletion average length | 2.45
Insertion rate per base | 0.03%
Insertion average length | 2.74
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1073629
% of reads mapped to multiple loci | 11.75%
Number of reads mapped to too many loci | 381050
% of reads mapped to too many loci | 4.17%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 5858670
% of reads unmapped: too short | 64.12%
Number of reads unmapped: other | 7365
% of reads unmapped: other | 0.08%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
So I guess the problem does not come from the concatenation.
As I also wrote below playing with the options mentionned in the post you suggested does not solve the problem neither.
I currently trying what Pavan suggested.
Thank you for your help!
Kind regards,
Dear @Victoria_Moris,
Yeah, I would suggest you check the recommendations of Pavan. You can also quickly use FASTQC to see if your reads are really between 75 and 150 bp.
Thank you for the excellent suggestion.
Sorry it took some time to get the results from the blast search.
Indeed it seems that the unmapped reads are contaminant: mitochondrial, rRNA, and human DNA.
I guess the human contamination cames from the preparation of sampes?
It is the first time I have that though.
It cannot come from the sequencing step?
Now that I know that, do you think it is still make my analyses (Differential gene expression analysis)?
Is it better to remove the contaminants before the mapping?
Or is it fine to use the bam files like this since during the mapping these reads are excluded?
Please let me know if it is better to run some additional analyses.
Thank you again for your help!
Best wishes,
Hi Victoria,
At the moment I would not proceed to the DE analysis because you used very tolerant mapping parameters to achieve 54% uniquely mapped reads. Maybe some portion of mapped reads is from homologous genes between Homo sapiens and Platynereis dumerilii (the organism that you are studying?), if there are any (I have no idea if they share any).
It is better to find whether the major portion of the reads is coming from the Platynereis dumerilii or from human. To find out this, try mapping all your reads with the default STAR parameter to the human genome to check how many reads can be mapped. If significant portions of the reads are from human and if you have to use this sequencing data then you have to filter out the reads that are mapped to human before proceeding to counting and DE analysis.
I am running that now.
If there is a high proportion how can I filter the fastq files and remove the reads from human?
I will put the RNA star log file here as soon as the run is over.
Thank you for your help.
Best,
Here is the log from RNA STAR while mapping the reads onto human genome:
Number of input reads | 35461958
Average input read length | 269
UNIQUE READS:
Uniquely mapped reads number | 764850
Uniquely mapped reads % | 2.16%
Average mapped length | 243.28
Number of splices: Total | 6051
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 1857
Number of splices: GC/AG | 321
Number of splices: AT/AC | 9
Number of splices: Non-canonical | 3864
Mismatch rate per base, % | 2.12%
Deletion rate per base | 0.02%
Deletion average length | 2.12
Insertion rate per base | 0.01%
Insertion average length | 2.10
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1298203
% of reads mapped to multiple loci | 3.66%
Number of reads mapped to too many loci | 1549
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 33391189
% of reads unmapped: too short | 94.16%
Number of reads unmapped: other | 6167
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Do you think it is ok to run the mapping of Platynereis reads with default parameters or better to remove those reads which map onto human genome?
Thank you again for your help.
Best,
It’s strange that BLAST found human contamination but you cannot align many reads to the human genome.
Yes, you can map against Platynereis with default parameters and continue with DE analysis. When you use default parameters, you don’t have to remove the reads that map to the human genome.
A lot of unmapped reads were also identified as mitochondrial or rRNA.
It does not make sense then to decrease the two parameters mentionned above to 0.3 instead of 0.6?
Do you have an idea why so many reads were identified in Hisat2 as unpaired?
Thank you again for your help!