BLAST: unidentified overrepresented sequences broadly match whole genome sequencing contigs

Egle · December 6, 2022, 3:28pm

Another quality control question.

I trimmed my datasets of interest with Trim Galore! After this, I ran FastQC to check if the removal of adapter and primer sequences was successful. While, based on the Adapter Content plot, the adapters were removed successfully during trimming, I still had some overrepresented sequences in some datasets:

However, since there were no hits to indicate what these were, I copied them into a FASTA file and ran them using the NCBI BLAST + blastn tool against the locally installed BLAST databases.

The output of this was the following:

The reads matched whole genome sequencing contigs across multiple species (predominantly, mammalians). What I am wondering now is how to interpret this - are these overrepresented sequences mapping to some potentially conserved sequences across species and are overall false positive (in terms of not being a contaminating adapter, primer, etc.) or if it’s something else?

If they match to innocent sequences, can I ignore the overrepresented sequences when running downstream analyses?

The reads I am investigating are short (~76 bp), so I would assume they have higher chances of matching something broadly than longer reads. They are also from RNAseq experiments (I read that sequences flagged as overrepresented can sometimes indicate highly expressed genes).

Would appreciate your opinions and advice.

Thanks!

jennaj · December 7, 2022, 1:13am

One thing that I usually try is to review regions like that in the UCSC Genome Browser https://genome.ucsc.edu/. (I’ve also just googled them, and tried at NCBI, etc). Sometimes you won’t find out but can always check to see if they at least fall out during read mapping against a genome/transcriptome later on or not. Public WGS data is much “noisier” and sometimes the age of the sequence submissions and sequencing protocols are clues, too.

I tend to try to BLAT map the “mystery sequence” to a well annotated model organism (sometimes a few, if I don’t get hits with the first), then click into the hit and view it in the genome alignment view. Consider scrolling down and turning on more tracks to suit what seems relevant to make a decision.

I usually start with the Comparative Genomics → Conservation track plus everything in the Variation and Repeats section along with the existing defaults since all those combined provide basic annotation markers for the hit region (or regions – which can also be informative).

The rest is a judgement call Happy hunting!

Egle · December 7, 2022, 2:43pm

Oh wow, those are some really informative tools. I had a look, these sequences hit regions encoding ribosomal mRNAs or other components. Overlapping with rhesus and mice. Which makes sense!

Thank you

Topic		Replies	Views
How to find Adapter sequence for trimming troubleshooting , fastqc , geo	1	109	October 31, 2024
sudden change in STAR alignment of RNA-seq data usegalaxy.eu support transcriptomics , igv , cutadapt , rna_star	1	35	October 17, 2024
Using Cuadadapt on Galaxy to remove long G repeat regions usegalaxy.eu support	0	338	June 11, 2020
choosing the right QC option usegalaxy.org support filter , cutadapt , quality-control	1	745	November 23, 2022
RNAseq, reads outside exons gtn-tutorial	4	1015	December 14, 2018

BLAST: unidentified overrepresented sequences broadly match whole genome sequencing contigs

Related topics