Another quality control question.
I trimmed my datasets of interest with Trim Galore! After this, I ran FastQC to check if the removal of adapter and primer sequences was successful. While, based on the Adapter Content plot, the adapters were removed successfully during trimming, I still had some overrepresented sequences in some datasets:
However, since there were no hits to indicate what these were, I copied them into a FASTA file and ran them using the NCBI BLAST + blastn tool against the locally installed BLAST databases.
The output of this was the following:
The reads matched whole genome sequencing contigs across multiple species (predominantly, mammalians). What I am wondering now is how to interpret this - are these overrepresented sequences mapping to some potentially conserved sequences across species and are overall false positive (in terms of not being a contaminating adapter, primer, etc.) or if it’s something else?
If they match to innocent sequences, can I ignore the overrepresented sequences when running downstream analyses?
The reads I am investigating are short (~76 bp), so I would assume they have higher chances of matching something broadly than longer reads. They are also from RNAseq experiments (I read that sequences flagged as overrepresented can sometimes indicate highly expressed genes).
Would appreciate your opinions and advice.
Thanks!