Understanding BLASTX Tabular Output Against Swiss-Prot (25 Columns)

Hi everyone,

I’ve used the NCBI BLAST+ blastx tool to compare an assembled bacterial genome against the Swiss-Prot database. The output is in tabular format with a total of 25 columns.

I have a few questions:

  1. Could someone help explain the meaning or description of columns 13 to 25?
  2. I noticed there’s no column directly describing gene function. Why is that the case?
  3. Is it possible to retrieve the predicted protein sequences from any of the output files generated during the run?

Thank you in advance for your help!

Hi @wee

Good questions! Let’s go through each.

The output description is at NCBI but you can also find it down in the Help section on any of the BLAST+ tool forms. I’ve quoted that here:

Output format

Because Galaxy focuses on processing tabular data, the default output of this tool is tabular. The standard BLAST+ tabular output contains 12 columns:

Column NCBI name Description
1 qaccver Query accession dot version
2 saccver Subject accession dot version (database hit)
3 pident Percentage of identical matches
4 length Alignment length
5 mismatch Number of mismatches
6 gapopen Number of gap openings
7 qstart Start of alignment in query
8 qend End of alignment in query
9 sstart Start of alignment in subject (database hit)
10 send End of alignment in subject (database hit)
11 evalue Expectation value (E-value)
12 bitscore Bit score

Until BLAST+ 2.5.0, the first two columns were qseqid and sseqid, which were usually strings contained multiple pipe-separated entries. In BLAST+ 2.5.0, the first two columns became qacc and sacc (accesion only), while in BLAST+ 2.6.0 this was changed again to use qaccver and saccver (accession dot version).

The BLAST+ tools can optionally output additional columns of information, but this takes longer to calculate. Many commonly used extra columns are included by selecting the extended tabular output. The extra columns are included after the standard 12 columns. This is so that you can write workflow filtering steps that accept either the 12 or 25 column tabular BLAST output. Galaxy now uses this extended 25 column output by default.

Column NCBI name Description
13 sallseqid All subject Seq-id(s), separated by a ‘;’
14 score Raw score
15 nident Number of identical matches
16 positive Number of positive-scoring matches
17 gaps Total number of gaps
18 ppos Percentage of positive-scoring matches
19 qframe Query frame
20 sframe Subject frame
21 qseq Aligned part of query sequence
22 sseq Aligned part of subject sequence
23 qlen Query sequence length
24 slen Subject sequence length
25 salltitles All subject title(s), separated by a ‘<>’

The third option is to customise the tabular output by selecting which columns you want, from the standard set of 12, the default set of 25, or any of the additional columns BLAST+ offers (including species name).

The function per gene is updated, while the gene structure itself is usually not. That means the protein database you are mapping against changes less frequently, and BLAST only cares about that part of it. The meaning (function) is another layer that is added in later.

You can try with a tool like annotateMyIDs when the target is one of the supported types, and the mapping is against a single species. For Swiss-prot, this won’t work, and you would need to pull in annotation files from the database source, then merge based on common column identifiers (hit sequence ID). The tool NCBI Datasets Gene might work with Swiss-prot IDs, but I’m not sure, so you can try.

Not the predicted sequence itself from tabular output, but you can from the XML using Parse blast XML output. You could also explore something like BlastXML to gapped GFF3 for more details.


Other options for bacterial assembly annotation include:

  1. Bakta.

  2. Methods like this one → Hands-on: Bacterial Genome Annotation / Bacterial Genome Annotation / Genome Annotation

  3. Publications. Replicating methods is usually possible.

Hope this helps! :slight_smile:

Thanks so much

1 Like

Ah, yes, the sequence descriptions are a good place to start. You can layer in more later if you want to. :slight_smile: