Variant calling with VarScan on WGS data

Aidin · October 6, 2024, 8:57pm

Hello Galaxy,

First of all, thank you for providing such a valuable platform!
I am currently running VarScan on a filtered whole genome sequencing BAM file with 35GB of data. The tool appears to be running without any errors so far. Is it normal for VarScan to take this long when processing a WGS BAM file of this size? Thank you

jennaj · October 7, 2024, 7:08pm

Welcome, @Aidin

Is your tool in the executing phase (yellow)? Or is it still queued (grey)?

More details about how jobs run at public Galaxy servers:

If the job is queued (grey), it might be still waiting for resources to become available, or there could be a problem with the inputs.

So, given that context: check your inputs, and make sure they match the data the tool is expecting to process.

Data in BAM format is not an appropriate input for Varscan directly. This tool is expecting a pileup dataset instead. You can create a pileup dataset from a BAM dataset, but that involves another tool.

Screenshots

Tool form input area (expand the accepted formats toggle)

Screen Shot 2024-10-07 at 11.44.42 AM

Tool form Help area (scroll down to find this)

Screen Shot 2024-10-07 at 11.49.19 AM

You can find the SAMtool mpileup tool at the public Galaxy servers. This the tool link at the UseGalaxy.org server https://usegalaxy.org/root?tool_id…samtools_mpileup/2.1.7

And, if you are running a workflow, go into your Data → Workflow Invocations to see the details of the progress.

For more help, you can share back your history, and a tutorial link, or where you sourced a workflow with a link, too. All of this provides extra context. How to share your work is explained in the banner at this forum, or please see here directly. How to get faster help with your question

The UseGalaxy servers are very busy today due to a training event that is running all this week, so some extra patience might be needed, but everything should in general be working as usual.

Please let us know if this helps or not, or if you have any followup questions!

Aidin · October 7, 2024, 9:58pm

Thank you for you response! @jennaj

The job was completed just today, so that’s good news but moving forward, I will definitely consider using SAMtool mpileup before running VarScan

Thank you for your support!

Aidin · October 10, 2024, 9:23am

Hi @jennaj,

Apologies for reaching out again. I have five WGS datasets from five gene-edited replicate samples, and I’m working on assessing the off-target effects of the gene editing machinery. I would greatly appreciate any guidance you could provide on the following points:

I’m interested in identifying somatic variants shared across all samples. Is there a way to find the common variants among the five samples using Galaxy with the VCF files and then perform downstream analysis on a single VCF file? I’m currently using VarScan for variant calling.
I haven’t performed deduplication on my WGS data, as doing so would result in losing nearly half of the reads. However, I have filtered the data multiple times, left-aligned the reads around indels, and performed recalibration. Given that I want to cross-compare the variants and extract shared SNVs, would skipping deduplication likely result in a significant number of false positives?
Lastly, just to confirm, does the GEMINI annotation software support the annotation of intronic variants?

Thank you so much in advance and best wishes!
Aidin

jennaj · October 10, 2024, 5:49pm

Hi @Aidin

So glad all of this helped! My quick thoughts for your question…

For item 1, you can call the variants and produce a single multi-sample merged VCF. An example is here Hands-on: Exome sequencing data analysis for diagnosing a genetic disease / Exome sequencing data analysis for diagnosing a genetic disease / Variant Analysis (#generating-freebayes-calls).

For item 2, the reverse seems more likely but other can comment. Why? Not deduplicating will result in over-representation per sample. Think of this as excess “noise” in the data, making it more difficult to detect any signal. You could test this: try both ways and see what happens.

For 3, if your SNP is annotated in a VCF file, I don’t know of any limitations with Gemini loading that into the database. It is just an SQL database, and you can construct custom queries. See the guide on the tool form, and experiment?

I don’t want to offer too much scientific advice here, since it will start to get out of scope! A scientific forum where others that are doing what you are doing would be better. Just know that anything you can do outside of Galaxy can probably be done in Galaxy, too. If you find a use-case where you have trouble, and Galaxy seems to be missing something, we would be interested in learning about it.

Aidin · October 12, 2024, 9:50am

That’s really helpful! Thank you! @jennaj