I’m looking to use the NCBI + blastp tool within Galaxy (link) to analyze microbial peptides. I have previously used the NCBI website (NCBI Protein BLAST) and wanted to assess if I could reproduce my results in Galaxy.
When using the NCBI BLASTP website, I selected the non-redundant protein sequences (nr) database:
Title:All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
Molecule Type:Protein
Update date:2025/03/13
Number of sequences:881786946
In the Galaxy tool, the most recent (locally installed) NCBI nr database is dated September 2023. Can I request an updated version be uploaded/installed?
Updated indexes take time to process, so this will not be immediate.
The web blast tools at NCBI always hosts the most current daily version, so that tool versus Galaxy won’t ever be exactly comparable using the full indexes. That said, if this is a technical exercise, you might be able to use smaller queries and carefully constructed settings and result filtering to achieve a similar result that is small enough to review then confirm the additions/removals between index releases.
Thank you for your response and for opening a ticket! I do want to note that the ticket says “Sep 2024” but on Galaxy, the year is actually 2023
I also would like to ask for advice on how BLASTP should be used in my analysis. For context, I am using microbial peptide sequences, and these queries are shorter than 30 residues. My objective is to annotate these peptides with taxonomy and functional information. Calling back to another post of mine: I am using Unipept and BLASTP in tandem as I analyze microbial peptides. I’m using two tools to boost confidence in peptide identification and their characterization.
I have thought of 2 approaches for BLASTP database:
Use the complete NR database to have broad coverage, but admittedly, this seems like overkill (since I’m only interested in the microbial component) and will result in lengthy runtimes.
Build a smaller and more manageable NR database for faster searching–is this feasible? Theoretically, the tradeoffs would be less coverage and potentially too restrictive, resulting in this database being useful in specific cases/for certain types of data, ie., cannot be used for various applications.
I don’t want to be too broad nor too limiting in my strategy. Any suggestions are welcome!
Thanks @katherine-d21 for the date catch! I’ve updated the ticket.
I think the subset has some benefits but you could test this as you go by comparing a full run with your subset run for a few samples. Once you like the subset, then future runs could be just against your custom subset with a bit more confidence.
That will also pre-filter out any wildly incorrect cross species matches at the start, which would allow you to better limit the hits reported (instead of filtering after).
To create the subset, instead of creating your own indexes (this will get tedious) what about using the ** Advanced Options → Restrict search of database to a given set of ID’s option with the full database? You can use taxIDs here.
That would allow you to create subsets on demand by modifying which core index elements to include/exclude. It might take a few runs to find the optimal subset, and you could modify it based on sample/goals. The process will be quicker for you and would only require changing the filter list, instead of also processing all the steps to retrieve, parse and index the associated reference sequences themselves.
I really like your idea of an initial full run and then creating/running subsets for comparison. I do have more follow-up questions to make sure I understand what you shared:
For creating the subset, can you define/clarify what you mean by “core index elements” and “filter list”?
My interpretation is the GI ID/SeqID/taxID is the index element that would be included/excluded from the subset that is used for database searching and filter list is the dropdown menu, Restrict search of database to a given set of ID’s > user selects positive/negative identifiers
Just so I understand what’s being included/excluded, for any ID type:
Selecting pos identifiers would include these elements in the subset?
Select neg identifiers would exclude them?
Thank you again in advance! I know I have a lot of questions.–I’ve helped with testing Galaxy tools and workflows in the past, but my current project has made me take on a much more involved role in the development side of making a workflow in addition to the testing I was doing before. Definitely have more respect for the efforts that go on in the behind the scenes
For the core index, I just mean the sequences (elements) are in that original index. Each of those sequences is associated with metadata. You can filer by the sequence identifiers but also those other metadata. The taxID is one of those other metadata.
For include/exclude – yes, this works how you describe it. There is a link down in the Help section to the BLAST docs. This works in Galaxy the same as anywhere else. The tools are the same – the Galaxy parts are just wrappers
around the original tools.
Great that you helped with testing before! If you do run into anything odd while building up your workflow, you can come to vet that here. Sometimes corner case situations show up later on.