blastp against nr database

database
blast
blastp

#1

Dear Galaxy users,
Is the only way to blastp in Galaxy is against a database from your own history (meaning downloading from https://www.uniprot.org/downloads, uploading to your history, make protein blast db, and then blast) or is there somehow access to these databases in the tool itself?
I want to blastp against the nr database or TrEMBL. but I tried to make a protein blast database of the TrEMBL, and I got an error: “Duplicate seq_ids are found”.
any idea what I should do?
Thanks a lot!


#2

Yes, this is true at public Galaxy servers.

If running your own Galaxy, native indexes can be installed. NCBI hosts pre-built indexes or you can create these yourself. Should you be interested in doing that, review the tool installation help in the ToolShed here: https://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus/e25d3acf6e68

I wouldn’t expect duplicated identifiers in fasta data obtained from this source, so the error was probably triggered by something else. I would suggest removing the description content and testing to see if that fixes the indexing problem.

FAQ: https://galaxyproject.org/learn/datatypes/#fasta

Update: I doubled checked and we did have a problem reported earlier about some of the Uniprot databases failing makeblastdb due to duplicated sequences. The help above will not correct the problem if present in the source data, nor will changing any of the options on the indexing tool form.

More info: BLAST+ processes “sequence_ids” a bit differently than other tools. The Galaxy makeblastdb tool can be used in a way that avoids the ID parsing requirement when building an index by setting the option “Parse the sequence identifiers” to “No”. But parsing or not parsing out IDs would not resolve a duplicate problem.

I didn’t find any duplicates in the “Reviewed (Swiss-Prot)” database and makeblastdb ran fine on it within Galaxy (parsing seqids or not). I am still running the test on “Unreviewed (TrEMBL)” to see if I can replicate your result (out of curiosity, not to “fix” the data, only the data source can decide whether or not to include duplicates).

How to resolve the problem: If you want a non-redundant protein database target, TrEMBL isn’t the best choice anyway as it is not curated and is definitely redundant in terms of content. I would suggest trying Swissprot or NR instead. Or, try both, compare the result, and decide which to use. There is overlap between the two. Swissprot is high quality, manually curated, but may lack newer data. NR contains most of Swissprot, plus more from other sources, and is not fully manually curated.


#3

Thanks for your reply and for checking these databases.
where can I download the NR database?


#4

The fasta and pre-indexed versions of NR can be found here: ftp://ftp.ncbi.nlm.nih.gov/blast/db/

This is a large database that will probably fail indexing (using makeblastdb) for exceeding memory or runtime resources at public servers (including all usegalaxy.* servers). It is best used in your own Galaxy.


#5

thanks for the link, but which one there I should download?
also I’m not sure what do you mean by using my own galaxy?
maybe I can ask the administrator to allocate more memory to this specific process?


#6

This depends on if you want to create your own indexes or use the pre-built indexes. The fasta version of the data and the pre-built indexes are both available. Review the README at this location to learn about both.

There are many ways to use Galaxy. Cloudman is a popular choice for scientists. Please see:

For very large data using a public server is often not an option. SwissProt is smaller and will index at Galaxy Main https://usegalaxy.org. NR is very large and will fail downloading and/or indexing due to exceeding resources.

Note: How successful a BLAST job will be using an indexed SwissProt database will depend on many factors, including the query content/size and the parameters set. If you decide to try it, you’ll need to test your exact data to see if it works. Using stricter parameters would be a very good idea (higher percent ID, higher coverage, limit the returned results). Using smaller queries might also be necessary. It is very easy to produce a great deal of output with BLAST, especially when using default parameters. The BLAST parameters in the Galaxy BLAST wrappers are the same as those used line command, so the standard BLAST tool manual is a good resource along with existing online forum discussions about the tool. These can be searched for online.

An administrator can sometimes allocate more quota space but this will not solve the problem when working with certain types of large data like this one. Downloading, indexing, and executing BLAST with data as large as NR requires computational resources that are beyond the scope of what a public server can provide.

Hopefully, that gives you some more answers and options!