[Beginner Question] Identifying Gene ID vs Protein ID

Hello all,

Sorry this is a very beginner question but I am new to bioinformatics and have used Galaxy to produce gene counts for Brassica napus data. My question has to do with nomenclature of “gene ID” - after running the analysis I am getting gene name and gene ID as “BnaC03…..” or “GSBRNA2T000343….” in the outputs, however I have been told that these are apparently protein names and I would like to find the gene ID that can be BLASTed and used to determine gene function.

Thanks!

Welcome @jjj

Yes, the nomenclature can be confusing! Especially since people interchange some of these terms when discussing as a type of short hand. Tools can label data in slightly confusing ways too!

One way to describe this:

  • The gene is the name of a footprint on the genomic strand where a distinct group of associated features are located. In annotation files, you’ll see this labeled as the gene_id.

  • Features may include transcripts. Each variant has a unique name.

  • Some transcripts are translated into proteins. Each also has a unique name.

  • Other features can be assigned to a gene, too! Example, a SNP.

Depending on what is being discussed, people might call any of these by the “gene” name when discussing the features .. but in the data and annotation files, the exact feature name will be used and there are some standardized ways (mostly!) to organize and label the terms.

If you need the sequence for a gene, the “gene” itself is just the total footprint coordinate location. Instead, you’ll need to know which feature to base that “gene” data on for your analysis. Meaning, which feature do you want to use to represent your gene? BLAST can consume a nucleotide or protein sequence input.

Both can be used for homology searches when investigating higher level annotation such as function. Example: you could use one “representative” full transcript, full protein, both together, fragments of either, or all transcript variants from a gene or all proteins (sometimes these are compared within a gene, or between different versions of a genomes assembly).

Searching one of your terms at NCBI, BnaC03 is a “gene” name. Your files probably have this labeled as the gene_id.

I didn’t find GSBRNA2T000343 at NCBI with a search, but I did find your species. The other term is likely in the annotation files here.

The first genome is annotated as the “reference”. This is the assembly view, and the FAQ here explains a bit about how to navigate the included files.

The FTP directory. The annotation files will list out all of the features “per gene” and each feature’s “identifier” will then map to specific features/sequence in the fasta files: .faa is nucleotide and .faa is protein (amino acids).

You could load the data into Galaxy and query it, combined against other lists of data, filter lists, generate fasta files to use to BLAST.

This will all be based on the reference data. But you may also have novel/experimental data? From some sequencing experiment? Fragments or assemblies? Is the goal to compare that to the reference to learn what your data includes? If yes, the reference data above is a good source! You can use it as custom genome/annotation with most tools, including BLAST.

Does this help? Would you like to share a bit of your output data? If you can mention the tool and reference data source used that can also help. If this is already in a Galaxy history, you could share it back here, let me know which dataset to look at, or capture some screenshots.

Let’s start there! :slight_smile: