BLAST: unidentified overrepresented sequences broadly match whole genome sequencing contigs

Another quality control question.

I trimmed my datasets of interest with Trim Galore! After this, I ran FastQC to check if the removal of adapter and primer sequences was successful. While, based on the Adapter Content plot, the adapters were removed successfully during trimming, I still had some overrepresented sequences in some datasets:

However, since there were no hits to indicate what these were, I copied them into a FASTA file and ran them using the NCBI BLAST + blastn tool against the locally installed BLAST databases.

The output of this was the following:

The reads matched whole genome sequencing contigs across multiple species (predominantly, mammalians). What I am wondering now is how to interpret this - are these overrepresented sequences mapping to some potentially conserved sequences across species and are overall false positive (in terms of not being a contaminating adapter, primer, etc.) or if it’s something else?

If they match to innocent sequences, can I ignore the overrepresented sequences when running downstream analyses?

The reads I am investigating are short (~76 bp), so I would assume they have higher chances of matching something broadly than longer reads. They are also from RNAseq experiments (I read that sequences flagged as overrepresented can sometimes indicate highly expressed genes).

Would appreciate your opinions and advice.


One thing that I usually try is to review regions like that in the UCSC Genome Browser (I’ve also just googled them, and tried at NCBI, etc). Sometimes you won’t find out but can always check to see if they at least fall out during read mapping against a genome/transcriptome later on or not. Public WGS data is much “noisier” and sometimes the age of the sequence submissions and sequencing protocols are clues, too.

I tend to try to BLAT map the “mystery sequence” to a well annotated model organism (sometimes a few, if I don’t get hits with the first), then click into the hit and view it in the genome alignment view. Consider scrolling down and turning on more tracks to suit what seems relevant to make a decision.

I usually start with the Comparative GenomicsConservation track plus everything in the Variation and Repeats section along with the existing defaults since all those combined provide basic annotation markers for the hit region (or regions – which can also be informative).

The rest is a judgement call :slight_smile: Happy hunting!

1 Like

Oh wow, those are some really informative tools. I had a look, these sequences hit regions encoding ribosomal mRNAs or other components. Overlapping with rhesus and mice. Which makes sense!

Thank you :blush:

1 Like