Removing PCR duplicates in GStacks2, galaxy.eu

hello everyone,

i´m sorry I´ve been away for too long.

thank you for the reply, yes it´s the same tool I´m using, However I do not have a reference genome. My data have to be set for “the novo mode”, and within it´s options I do not see
PCR duplicate removal.

do you have suggestions?

thanks

Silvia Bettenocurt

Hi @Silvia

This is a good followup question, so I’m glad you asked!

The duplicate removal is possible with gstacks when there is a reference because the reads are first mapped to it (and is why the input is an aligned BAM, not fastq). This sets up a coordinate system where “exactly repeated” mapping characteristics can then be filtered out as PCR duplicates.

With denovo, there isn’t a baseline coordinate mapping to use, and gstacks isn’t the tool choice.

See the guide here

then, the FAQs link has Stacks: Stacks: Frequently Asked Questions

What are the input and output data formats for Stacks?

In the de novo case, data is read by the ustacks program and it currently can read either FASTA, FASTQ, or BAM formats. When a reference genome is available, aligned data is read by the gstacks program and either SAM or BAM formats can be input.

The tool is in Galaxy as one of these

  • Stacks2: ustacks Identify unique stacks

  • Stacks2: de novo map the Stacks pipeline without a reference genome

  • Stacks: ustacks align short reads into stacks

  • Stacks: de novo map the Stacks pipeline without a reference genome (denovo_map.pl)(Galaxy Version 1.46.0)

The protocol is in a publication (paywalled :upside_down_face:) but we don’t have a dedicated Galaxy tutorial I can point you to. Instead, try searching online to see if anyone has broken this out if you can’t see the paper. The core steps will be about the same in Galaxy – the difference is usually just how to set the metadata such as datatypes, and these are all fastq, BAM, tabular datatypes, which are common across many tools.

If you are new to Galaxy, consider running though a Learning Pathway like this to get familiar with how to organize data and navigate around the interface. And, if you are already familiar with Bioinformatics analysis, you can simply consider this a reference. → Learning Pathway: Introduction to Galaxy and Sequence analysis.

Hope this helps again! :slight_smile: