Running Diamond against NCBI's NR database

Hello @jennaj

This is a really helpful post and I am trying to do similar to the OP. Basically, I would like to run DIAMOND against NCBI nr database. Based on this helpful thread, I am hoping to import the pre-indexed NR database from the link you provided: ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*tar.gz into my Galaxy instance.

Then, I hope to run DIAMOND on Galaxy with the following fields:

  1. “What do you want to align” --> Align DNA query sequences (blastx)
  2. “Input query file” --> I’ll input contig .fastq of metagenomic data
  3. “Will you select a reference genome from your history or use a built-in index?” --> I guess I would say “use one from my history” and then select the NCBI nr file that I imported

I am not sure how accurate my pipeline is, but tentatively I have two questions that I would be so grateful to hear advice about:

  1. It seems from this post that public and usegalaxy platforms cannot accomodate this analysis. Of the platforms you listed (https://galaxyproject.org/choices/), would any (free options) support my tentative pipeline? I may not have access to “Academic Cloud Services” because I am a postdoctoral scholar at a University in Japan. It seems “TIaaS” may be an option and I can apply (worried if I would be accepted) but would it even be suitable for this purpose? I would appreciate so very much any advice on which platform to possibly focus on.

  2. Does my pipeline even seem possible?

Thank you for any input.

1 Like

@Hi @SuzuBell

Diamond makedb and Diamond itself are good choices. I wouldn’t try using this particular fasta “from the history”, wherever you work. It is fairly large and will consume more resources per job each time it is run if you don’t use Diamond makedb.

Other usage options redundantly create a new index at the start of processing and may fail for memory or runtime limits at public Galaxy servers, or at your own if insufficient resources are allocated.

Both tools are hosted at Galaxy EU https://usegalaxy.eu or you can add them to your own server from the ToolShed https://usegalaxy.org/toolshed/.

Maybe try the EU server first? NR is already a built-in index there and would be more likely to work, but it is an older version (Feb 2015). I am not sure if Diamond makedb will work or not with NR from the history (if you want to use the updated release). Still, both are certainly worth testing out.

For cluster options described at https://galaxyproject.org/use/:

Academic clouds

  • TIaaS is intended to aid those who are teaching/running workshops. You are correct that this is not appropriate for your work.
  • Jetstream is intended for US academic researchers
  • Others are described at the web link
  • Note: Not all listed under “Academic” clouds are actually free for everyone. Instead, most of those are set up for particular regions/countries and funded by government grants. Others are fee-based for everyone but have academic pricing rates offered. Some are domain-specific and you wouldn’t be an admin (meaning: are pre-configured so you won’t be able to install more tools, etc).

Commercial clouds

  • Each is described at the link.
  • Cloudman at AWS is probably your best choice if the work is too large for Galaxy EU. Working at AWS is not entirely free, but they do offer scholarships for both teaching and research work through a web form. To clarify: the Cloudman version of Galaxy is free (software), but the AWS resources to use it there is fee-based (hardware: storage/computational).

Hope that gives you some options. I broke this out into a new thread since a different mapping tool is being used. The prior post is linked to this one now as a reference.

Hello @jennaj

This is so very helpful. Thank you for the suggestion especially to try Galaxy EU. I would not have thought of that as I am not located in Europe. I will give it a try.

One small question I have is in regards to using NR on Galaxy EU. I plan to:

  1. Try running DIAMOND using the build-in NR index on Galaxy EU (older version Feb 2015).

  2. Try running DIAMOND on Galaxy EU using a new version of NR. This is where my question lies. You mentioned: “I am not sure if Diamond makedb will work or not with NR from the history (if you want to use the updated release).” I will give it a try and report the results here. However, do you recommend that I place ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*tar.gz into my history on Galaxy EU and then try to run Diamond makedb on it? I am only clarifying because you also mentioned “I wouldn’t try using this particular fasta “from the history”, wherever you work.”, so I was curious if you are suggesting to work with a different NR file or are saying it is unlikely this file will work anywhere but if I do try it somewhere it should be on Galaxy EU?

Thank you so much again for your support.

1 Like

@SuzuBell

Glad that helped.

For your second question, here are some more clarifications:

2a. The Upload tool in Galaxy can accept tar.gz archives, but only the first dataset in the archive will be loaded. This particular archive should only contain one dataset (the fasta for nr) but if you have trouble with that, uncompress the archive locally first, then Upload the fasta directly.

2b. Regarding the “from the history” option:

  1. some tools accept a fasta dataset directly on the tool form as a “Custom Genome”. (one version of “from the history”)
  2. some require a “fasta” that has a different type of indexing: A “Custom Genome” promoted to a “Custom Build”
  3. some require a fasta that has been pre-processed by a special tool from the same tool suite (another version of “from the history”)
  4. some have built-in indexes
  5. …and some accept all or a combination of the above!

My mistake in the first reply – Just double-checked and Diamond accepts the input type “3” above (a fasta pre-processed with the “Diamond makedb” tool), or item “4” (build-in index).

Initially, I thought Diamond itself accepted either item “1” or “3” or “4”, but that was wrong, it expects item “3” or “4”, and not “1”. Glad you pointed this out!

Diamond makedb only accepts item “1”.

Any method could fail if the job is too large to process (query size is a consideration, too, not just the target), but it is certainly worth trying at EU before investing in setting up your own Galaxy server.

Tools can have many usage options :slight_smile: Hopefully, this reply reduces the confusion I caused!

More about Custom Genomes/Builds

FAQ that may also help, but probably later on, and with different tools. This particular fasta sourced from NCBI will already be formatted correctly to use with the Diamond makedb tool as a Custom Genome:

@jennaj

Thank you so much again for your reply! I would have never come up with this possibility and really appreciate the suggestion. I can check into whether this procedure works or perhaps is too computationally expensive (as you have indicated may occur). Thank you again!

1 Like