Taxonomy workflows: reads to taxa

Hi Jemma,

I have some samples being sequenced. These are for soil fungi eDNA metarcoding.

On NCBI I have found and downloaded a FASTQ file (2MB) of the type I will be using.

Sequences will be about 300bp long.

I would like to use this to set up the following bioinformatics workflow so I am ready to go when I get my FASTQ files.

  1. read in Illumina MiSeq FASTQ files (the files will already be demultiplexed)
  2. trim sequences (if needed)
  3. filter them e.g. reject low quality score) (if necessary).
  4. Cluster in OTUs using appropriate parameters
  5. Match OTUs using appropriate parameters to a suitable UK fungi sequence database
  6. Output the taxa list to a spreadsheet

I would like to generate QC charts at each stage.

Might be some more steps but that would be a good start.

This is a pretty standard workflow and there must be a workflow like this already in Galaxy for me to copy/adapt.

How would I find it ?

How do I start building and testing a workflow ?

Adam

Hello @Adam_Hillier

Glad to learn you are proceeding with your project! :scientist:

There are a few primary places to source workflows for Galaxy.

IWC – Production quality workflows

These are curated, so if you can find what you need here, that would be preferred! Each has been optimized for large batch stream processing. This catalog is newer and growing and has stricter community standards.

Galaxy Hub – Public Workflows

This is the “meta” search. You’ll find workflows from the GTN trainings, WorkflowHub, and the Public Workflows available from the communities at the UseGalaxy* servers.



A workflow from either can be customized further, too. I would be pretty common to break out an analysis like yours into two or three distinct module workflows. Then scientists could run them separately or nest as subworkflows into a single master workflow that does everything with a bit more customization (reference data preparation, intermediate file offloading, workflow reports).

I didn’t find an IWC workflow for eDNA specifically and one of the training quality workflows from the GTN is probably too simple for your needs (no clustering). The other, using Obitools, will work best at one of the “Available at these Galaxies” servers for now – UseGalaxy.eu would be a good choice. :slight_smile:

Hope this helps to get things started! :slight_smile:

Hi,

I have got the Wolf Diet eDNA metadata set and workflow and have run rhat.

It almost worked. But it failed at one step and I am struggling to diagnose. Pretty sure if I get that fixed it will then run to completion.

If I can get that to work I imagine I will be able to adapt the workflow for my own data.

Adam

1 Like

Great! Glad you learn you have made progress! :slight_smile:

If you think the issue was technical, and would like some feedback, you are welcome to post back a share link to the history (or better, the workflow invocation – see the Share button in the top menu bar of that view). Maybe we can help to solve it here?

If you would rather share in a private thread, we can do that, but it will be harder to get feedback from our developers (if needed). You can decide.

HI again,

I have managed to read in one of the WolfDiet fastq files and checked that it’s ok.

I have blasted the fastq file against the reference database.

So I have a list of sequences which I presume are all matched to a species name.

What I am struggling with now is how to add family, order, class and phylum to each match sequence.

Once I can do that I am pretty much there.

Adam

Hi,

I think I am missing some tools such as Taxonkit and NCBI Taxomomy.

Any ideas why I can’t see them in tools ?

These will allow me to use a column eith taxonomic ID to produce other columns of taxonomic lineage (phylum, class, order, family, genus, species.

Adam

HI team,

Galaxy needs a published metabarcode database creation and BLAST workflow that works. I am happy to do that.

The Wolf Diet workflow writeup talks about ecoPCR and ecoTag which I cant find in Galaxy tools.

I have also read about NCBI Taxomony and Taxonkit which should be on Galaxy but I can’t find these either. Do I have access to only some tools ?

Without these I can only BLAST my depelicated metabarcode fasta sequences against the full NCBI nucleotide database (see below) which will take much much longer.

Adam

’

Hi all,

Update on my Wolf Diet workflow trial.

I got the BLAST to run reasonably quickly using the full NCBI NT (15 Aug 2024) database. Not ideal perhaps but it worked.

I then merged in the taxonomy lineage by downloading a file with TAXIDs (output from BLAST) and running taxonKit in my terminal and then uploading the file back into Galaxy. ChatGPT help me do this.

This allowed me to produce a Krona Chart

I then cut and paste the species list into ChatGPT and asked it to tell me where the samples were from (see below)

Adam

:one: China — strongest match

China fits both wolves and many realistic prey species in your data.

Wolf prey from your list found in China

  • Capreolus pygargus (Siberian roe deer)

  • Cervus elaphus (red deer)

  • Cervus nippon (sika deer)

  • Cervus albirostris (white-lipped deer)

  • Elaphodus cephalophus (tufted deer)

  • Procapra gutturosa / picticaudata / przewalskii (gazelles)

  • Pantholops hodgsonii (Tibetan antelope)

  • Marmota sibirica / himalayana (marmots)

Why this works

  • Wolves are native to northern & western China

  • These are documented wolf prey in steppe, plateau, and forest systems

  • No need to invoke introductions or zoos

Hi all,

I now feel ready to use Galaxy to process and analyse my own data.

It would be good if Taxonkit was in Galaxy. Is that possible ? Then the whole workflow could be done in Galaxy.

Is there a published paper about the wolf diets so I can read it and compare with my findings ?

Thanks

Adam