Error in Stringtie Results

Hello,
I downloaded GTF file from NCBI and uploaded it to my Galaxy. I ran HISAT2 along with my sample but I incurred an error while using it with StringTie.
Error : no valid ID found for GFF record
I even removed the header comments and ran Stringtie against my HISAT result but still got an error.
Does anyone seem to know what the problem is?

GTF File after removing the header lines

CP083173.1 Genbank gene 1 1155 . + . gene_id K9E40_00005; transcript_id ; gbkey Gene; gene_biotype protein_coding; locus_tag K9E40_00005;
CP083173.1 Protein Homology CDS 1 1152 . + 0 gene_id K9E40_00005; transcript_id unassigned_transcript_1; gbkey CDS; inference COORDINATES: similar to AA sequence:RefSeq:WP_004115883.1; locus_tag K9E40_00005; product sugar-binding protein; protein_id UQA84770.1; transl_table 11; exon_number 1;

Corresponding FNA file

CP083173.1 Gardnerella vaginalis strain JNFY11 chromosome, complete genome
ATGAACATAGGTAAAAAAGCAATCGCTTTATTTGTAGGTATTGCTGTAGTTGCTGGTTTATCAGCCTGTTCAGGTTCAAG
AGGTGGTGCATCCAAAAACGTAAGTCAAGGAATTGAAAAAGGTGCCACCATCGGTGTCTCTATGCCAACAAAGTCAGAGGā€¦

Hi @Faaiza_Ibrahim_B.Sc
maybe consider providing a brief description of your goal(s). Are you after gene annotation or read counting/gene expression?
You use HiSAT2 and StringTie on bacterial data. HiSAT2 is a gapped aligner that can align reads across introns. StringTie is often used for prediction of genes in eukaryotic genomes. On other hand, many bacterial genes present in operons. Read splitting can be disabled in HiSAT2. Not sure if SringTie is the best option for annotation of bacterial genes, but I am not familiar with the topic. For read counting maybe consider alternative tools such as featureCounts or htseq-count. You may need to tweak the setting, depending on the annotation file and your goals. Read counting tools count reads on annotation present in 3d column (type) and aggregate results using one of attributes from the last column. By default many read counting tools count reads against exons, but exon annotations may not present in some bacterial datasets (no ā€˜exonā€™ in 3d column). In this case you need to choose something else, for example, CDS or gene. You may get different results with different settings. The same for the attributes.
As I donā€™t know what you are trying to archive and cannot check the data, it is hard to answer your question.

Kind regards,

Igor

1 Like

Thank you for quick response!
Problem solved. My data is rna-seq and I hope to find out gene expression pertaining to my samples.
I have done untill Deseq2 and I have got certain novel genes (MSTRG)and hypothetical proteins for some genes and I am not sure on how to annotate it.
I planned on retrieving FASTA sequence for the novel genes and performing a two-step BLAST.

  1. BLAST against specific bacteria (In this case Gardnerella vaginalis)
  2. BLAST against related organism to Gardnerella

For hypothetical protein I have not decided on a procedure yet.

Thank you in advance.