Variant Calling from RNAseq

As mentioned, my question has to do with variant calling using RNAseq data. My workflow is basically: Trimmomatic → HISAT2 → FreeBayes → VCF filter → snpeff eff

My question is about the results of snpeff (see below). How/why am I detecting so many upstream, downstream, and intergenic variants if my starting point is RNAseq data?

Type (alphabetical order) Count Percent
DOWNSTREAM 5,285 17.28%
EXON 5,148 16.832%
GENE 2 0.007%
INTERGENIC 6,870 22.462%
INTRON 5,317 17.384%
SPLICE_SITE_ACCEPTOR 333 1.089%
SPLICE_SITE_DONOR 360 1.177%
SPLICE_SITE_REGION 1,602 5.238%
TRANSCRIPT 2 0.007%
UPSTREAM 5,218 17.061%
UTR_3_PRIME 300 0.981%
UTR_5_PRIME 148 0.484%

Hi @mitchellgc

Are you using a stranded protocol for all steps?

If using unstranded for any, or just some, that could explain your results. Using the wrong strand could also end up this way.

You can check strand with a tool like Infer Experiment.

Other than that, no data is perfect. And depending on the genome, not all transcribed elements may have an annotated feature yet. You could pre-filter to use only the “known genes” in all steps if you are not interested in discovery.

1 Like