I got RNAseq data from a collegue from a few years ago, that they never used. It is interesting for my project so wanted to have a look.
Those are 7 samples from healthy humans. They collected saliva samples. Library was done with QuantSeq 3‘ mRNA-Seq Library Prep Kit FWD with UDI 12 nt Set B1 (Lexogen). They used an mRNA specific kit.
I had a look at the raw data with Fastqc and it looked weird to me. I hoped that trimming with Trimmomatic might fix it but afterwards it looked the same. Only the N-content was fixed.
Can someone please explaine to me what is wrong with the data and how this happend?
How do I best clean this?
Here are some pictures :
Not really, I can interpret normal FAstQC results with no issues. And I have anlyzed sevreral RNA-Seqs wih Galaxy. However I have never seen this before and have not found an adequate example on the internet.
I don’t understand the weird TA sequence in the per base Sequence content. And i don’t understand why i can’t remove this unusally high adapter contamination.
I was hoping for some specifc help with what I am seeing´!
and that the reads ended up short, with adaptor detected at the end. This suggests that trimming with the automatic adaptors that Trimmomatic uses were not a match, and the trimming failed. You haven’t tried using CutAdapt or fastp yet. Both of those have an optional report that you can send to MultiQC. The example workflow above has an example that you can use as a template.
We can’t offer too much scientific advice at this forum (as @igor was clarifying) but we can help you to use the different tools in order to get all the results you might need for your own scientific review.
What to do:
Start with the raw reads
Determine what the original sequencing protocol was
Applying the correct trimming
Run FastQC again on the result
Review the reports all together in MultiQC, make inferences, rerun until the data seems correct, then try the downstream steps.
NOTE: When mapping to the human genome, you might be able to detect additional issues with the reads, in particular by reviewing the BAM inside of a genome browser like UCSC.
Hi @Sabsida,
As @jennaj suggested, check reads in BAM file and maybe start with raw data. For example, check CIGAR string and see if 5’ parts are soft-clipped or not. I am not familiar with the kit used for the library, but I am curious if the TA spike can be linked to UDI? If it is indeed the case, 5’ ends will be soft-clipped. If UDI is still present, demultiplex the data.
Kind regards,
Igor