Stringtie merge error

I’m following along the de novo transcriptome assembly Galaxy tutorial (De novo transcriptome reconstruction with RNA-Seq) with my own data. I’m stuck at the stringtie-merge step; I’m getting the following error message:

Error: could not any valid reference transcripts in /corral4/main/objects/6/5/e/dataset_65e2a2c5-762d-4f88-bec9-002deee1ae23.dat (invalid GTF/GFF file?)

Is this saying that none of the annotated sequences in my reference GTF file overlap or correspond with those in my Stringtie assembled transcripts?

So far, I’ve checked that the chromosome names are the same in the FASTA and gtf file and I’ve tried using both gtf and gff3 file formats for the reference annotation file.

1 Like

Hi @julsies,
could you share your history with me? I can have a look at it.

Regards

@gallardoalba

Just dropped you a message with a link to my history.

So I’ve also tried using a gff3 file, but I keep getting the error:

Error parsing attribute gene_id ('"' required for GTF) at line:

so I’ve stuck with gtf files, as they have the quotation marks (" ") around strings in the ninth column.

Hi @julsies

Try running your GTF annotation through the Stringtie merge tool first (by itself) to adjust formatting. Then, run the Stringtie merge tool again with all of your Stringtie outputs along with the modified GTF.

Prior Q&A with more details:

:information_source: Why is this needed? The tool expects a slightly modified version of GTF format. The annotation file included with the tutorial is already adjusted, so that step wasn’t included. Avoid GFF3 formatted annotation with this tool and most others, unless specifically noted otherwise on a tool form. More details are in the topic above, and those usually help to solve troubleshooting issues with this tool.

One last potential format issue to check for: A common item to correct as a data cleaning/prep step (any analysis/tool) is to remove GTF headers. Strict GTF format does not include headers, only data lines, but some data providers include one or more anyway for practical reasons (versioning, etc). So, if headers are present in your file, usually at the top and starting with a #, remove those first before using the annotation in any analysis steps. (Note: GFF3 format will have at least one header line, but that is in the specification for the datatype).

If all that still fails, then more is going on and @gallardoalba can help with the history review :slight_smile:

Hi @jennaj

Many thanks for your suggestions. I’ve just realized that because I’ve manipulated the gtf file (I manually converted the gff3 file to a gtf) in Excel, uploading it to Galaxy has caused the ninth column to adopt many extra quotation marks.

I’d assume this is what caused the following error message when I ran the gtf file by itself through Stringtie merge:

Error: no transcripts were found in input file

Do you have any suggestions on how I can manipulate the original gff3 file into a gtf so that it will be Galaxy-friendly when I upload it?

@jennaj @gallardoalba

I think I’ve solved it!

First, I used TextEdit to remove those pesky extra quotation marks. Uploading to Galaxy confirmed they were gone for good. Then, I ran the gtf file through stringtie-merge as you suggested, and it went through! The downstream steps seem to be working.

Thank you both for your time!

2 Likes