what's the sink at 50 percentile of genebody coverage by RSeQC for WGS results

wgs
mapping
rseqc
coverage
qa-qc

#1

Hello,
I used NGS: RSeQC - Gene Body Coverage (BAM) Read coverage over gene body. (Galaxy Version 2.6.4.2) - Input: bam, Reference gene model: bed12.
The input is a BAM results of a WGS 2x75 run on NextSeq, reference was hg38.
Then for lib#261, I got this significant drop at 50% percentile.
Can anyone explain it to me?
Or if there’s better tool for the coverage uniformity check.
Thanks
Larry


#2

Thanks for clarifying the usage.

X-axis = Gene Body Percentile (5’ > 3’) – all transcript regions normalized at 100
Y-axis = Coverage – counts of read coverage per transcript regions

This is a graph of the RseQC txt output, an alternative view of the pdf output produced by RseQC (Y-axis % coverage). This type of graph can be generated by MultiQC or other plotting methods.

What this means: The WGS reads are not mapping well to the middle of transcripts. There is 3’/5’ bias with some type of problem at very end of the 3’ UTR (sequencing artifact? library construction bias?).

Some things to check:

  • The sample size is pretty small and might be biased. Try checking a random sample (or all) of properly paired reads with a MapQ of 30.

  • Does the BED12 represent a UCSC “Genes and Gene Predictions” track that is complete (full gene)? RefSeq and Ensembl are good choices. Avoid Genebank’s “All mRNA” and other types of fragmented/high duplication tracks.

  • WGS data needs to be mapped with an unspliced mapping tool. Choices can include BWA/BWA-MEM and Bowtie2. Avoid spliced mapping tools – those are for spliced data (e.g. RNA-seq).

  • You might want to run FastQC on the original fastqsanger datasets to find out about artifact and other sequence problems that may be present.

Hope that helps!


#3

Thanks jennaj for the explanation.
For whole genome sequencing data as is the case here, what are segmented into 100 sections by RSeQC?
Thanks
Larry


#4

These are the transcripts defined by the BED12 dataset.

Please see: http://rseqc.sourceforge.net/#genebody-coverage-py

For WGS data, comparing this result to the complete genome coverage (not just transcript regions) can be informative. See the tool group BEDTools. Example tools of interest: Genome Coverage, MakeWindowsBed. Tool manual: https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html (explains line-command options that are mirrored on the Galaxy tool forms)