Stringtie merge error

I’m following along the de novo transcriptome assembly Galaxy tutorial (Hands-on: Hands-on: De novo transcriptome reconstruction with RNA-Seq / Transcriptomics) with my own data. I’m stuck at the stringtie-merge step; I’m getting the following error message:

Error: could not any valid reference transcripts in /corral4/main/objects/6/5/e/dataset_65e2a2c5-762d-4f88-bec9-002deee1ae23.dat (invalid GTF/GFF file?)

Is this saying that none of the annotated sequences in my reference GTF file overlap or correspond with those in my Stringtie assembled transcripts?

So far, I’ve checked that the chromosome names are the same in the FASTA and gtf file and I’ve tried using both gtf and gff3 file formats for the reference annotation file.

1 Like

Hi @julsies,
could you share your history with me? I can have a look at it.

Regards

@gallardoalba

Just dropped you a message with a link to my history.

So I’ve also tried using a gff3 file, but I keep getting the error:

Error parsing attribute gene_id ('"' required for GTF) at line:

so I’ve stuck with gtf files, as they have the quotation marks (" ") around strings in the ninth column.

Hi @julsies

Try running your GTF annotation through the Stringtie merge tool first (by itself) to adjust formatting. Then, run the Stringtie merge tool again with all of your Stringtie outputs along with the modified GTF.

Prior Q&A with more details:

:information_source: Why is this needed? The tool expects a slightly modified version of GTF format. The annotation file included with the tutorial is already adjusted, so that step wasn’t included. Avoid GFF3 formatted annotation with this tool and most others, unless specifically noted otherwise on a tool form. More details are in the topic above, and those usually help to solve troubleshooting issues with this tool.

One last potential format issue to check for: A common item to correct as a data cleaning/prep step (any analysis/tool) is to remove GTF headers. Strict GTF format does not include headers, only data lines, but some data providers include one or more anyway for practical reasons (versioning, etc). So, if headers are present in your file, usually at the top and starting with a #, remove those first before using the annotation in any analysis steps. (Note: GFF3 format will have at least one header line, but that is in the specification for the datatype).

If all that still fails, then more is going on and @gallardoalba can help with the history review :slight_smile:

Hi @jennaj

Many thanks for your suggestions. I’ve just realized that because I’ve manipulated the gtf file (I manually converted the gff3 file to a gtf) in Excel, uploading it to Galaxy has caused the ninth column to adopt many extra quotation marks.

I’d assume this is what caused the following error message when I ran the gtf file by itself through Stringtie merge:

Error: no transcripts were found in input file

Do you have any suggestions on how I can manipulate the original gff3 file into a gtf so that it will be Galaxy-friendly when I upload it?

@jennaj @gallardoalba

I think I’ve solved it!

First, I used TextEdit to remove those pesky extra quotation marks. Uploading to Galaxy confirmed they were gone for good. Then, I ran the gtf file through stringtie-merge as you suggested, and it went through! The downstream steps seem to be working.

Thank you both for your time!

2 Likes

Hello,

I tried what you suggested and ran my reference gtf through Stringtie as well before going on to the Stringtie merge step, but I am still getting the following error:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.

Of course, the original StringTie jobs were run without reference genomes, as I am attempting to assemble a transcriptome de novo, so I don’t understand the “-G” comment.

Additonally, if I continue from here to gffcompare, I get the following error:

No fasta index found for ref_seq.fa.

I thought this program used GTF inputs, not FASTA files?

Please help!

Hi @Kanishk Please explain the steps you have done. If you can incorporate reference annotation at the StringTie Merge step, then it can be incorporated at the StringTie step. The reference annotation needs to be based on the same exact reference genome or transcriptome used for mapping. If not, then the coordinates will not match up and a result like yours is output.

Tutorials: Search Tutorials

Jenna,

I’m following along the de novo transcriptomics tutorial (De novo transcriptome reconstruction with RNA-Seq) with my own data.

As the tutorial details, when constructing de novo transcripts, StringTie should be run without a reference genome. Then these transcripts should be assembled in StringTie merge, this time using a reference genome. This is where I’m starting to have issues. I’ve even reformated my reference gtf through Stringtie merge to make sure the annotation is identical, but to no avail.

" If you can incorporate reference annotation at the StringTie Merge step, then it can be incorporated at the StringTie step."

I’m not sure the question is whether I can use a reference genome in the Stringtie step. According to what I’m trying to accomplish, I shouldn’t use one here, right? Only starting at the Stringtie merge step.

Also, I used a locally cached hg38 reference genome for the mapping portion (in HISAT2), as suggested by the tutorial, but used a downloaded hg38 gtf file in the Stringtie merge and gffcompare steps later–again as suggested by the tutorial. For the sake of consistency, should I just use the gtf file in my history for HISAT2 and move forward from there? There’s another problem in that case because if I try to use a reference from my history, I can’t seem to select gtf files, only ones in a FASTA format…

1 Like

Ok, the second reply helps me to understand.

  1. Stringtie – the inputs are mapped reads. Whatever ever was mapped against is where the coordinates are derived from. All other inputs that are coordinate-based need to be based on that exact same reference genome.

  2. You mapped against the hg38 reference genome at this step in the tutorial.

  3. You can incorporate reference annotation during mapping (HISAT2) and/or transcript assembly (Stringtie) and/or transcript merging (Stringtie merge), then will also include the reference annotation for later steps (differential expression: Featurecounts and DeSeq2 if following the tutorial).

  4. The reference annotation that is first used represents the known genes/transcripts for your genome. You will still capture novel transcripts if you use one. The different tool forms will state the choices: create transcripts that are known (only – those included in the annotation) OR create known plus novel. You can run it both ways in different tools and compare.

  5. As you move through the analysis, the reference annotation will be updated and reformatted at certain steps. The updated version at an earlier step is what you want to use in the next downstream steps.

  6. The hg38 reference annotation GTF is now best sourced from UCSC. Not from the Table Browser but from the Downloads area here. There are a few Gene tracks to choose from. You can review what each of those represents then upload the selected GTF to Galaxy with the Upload tool. Just copy and paste in the URL and leave all other settings at the default. The correct datatype will be assigned.

I’m not sure what that annotation represents or where you sourced it, but some data providers create GTF data that are not quite in the correct format. It is also possible you have annotation with chromosome identifers that are a mismatch for hg38. I would stick with the UCSC GTF if this analysis is new to you – nothing special needs to be done to prepare it for use with tools. But if you really want to use another source, Gencode is probably the best alternative. You’ll need to remove the header lines and then update the datatype first. Instructions are in this prior Q&A.

Note: you may find prior Q&A that specifically states to avoid the UCSC GTF. That was about using the version from the Table Browser before UCSC started generating properly formatted GTF data available in the Downloads area for all of their genomes that had a Gene annotation track.

In my opinion, the version of hg38 reference annotation in the UCSC Downloads area is now absolutely the best choice. Especially if you choose the RefSeq Gene annotation (hg38.ncbiRefSeq.gtf.gz) – as that one is updated regularly, about monthly. Other gene tracks are fixed at prior releases. Other data sources may need reformatting and some are missing important attributes. But you can review and decide. Or maybe run distinct analysis paths with the different GTF choices and compare results. However you do this – start with a specific annotation GTF then keep using that throughout the same analysis. Meaning: don’t switch from RefSeq to Ensembl in the middle of an analysis or expect problems.

For this, my guess is that you are mixing up where to input a reference annotation (GTF) versus a reference genome or reference transcriptome (fasta) on the tool forms. Your analysis will include all three.

  • A reference annotation will be selected from the history for your use case. Featurecounts does have annotation available for a few genomes, including hg38, but that is for when you don’t need annotation for any other steps – and it won’t match external sources – so don’t mix that in for now.
  • A reference transcriptome will be selected from the history (always).
  • A reference genome for your case is the already indexed built-in version of hg38. Technically you can input a reference genome from the history, too, but that only works for very small genomes and there is no need if Galaxy has already indexed it.

That should cover all the questions/problems you were having. Please try this out :slight_smile:

1 Like