StringTie "no reference transcript". Solutions? Or need new alignments?

What I’m trying to do: DEG with wheat sequences by using HISAT2 -> StringTie -> Deseq2.

The problem: StringTie failing to use my provided GTF file, giving the warning:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.

I’ve noticed several other threads about this, often involving wheat, but with no clear (to me) follow-up solutions.

What I did before this point:

  1. Upload full fasta sequence of cDNA for 1 chromosome from Ensembl and ran NormalizeFasta as per these instructions. Using the full genome fasta will result in memory failures, and mapping to a single chromosome’s fasta results in failure to set metadata, every time.

  2. Run HiSat2

  3. Upload GFF3 file, also from Ensembl, that was converted to GTF format.

  4. Run StringTie using the HiSat2 BAM file and GTF file.

My thought is the problem is because I have to align to cDNA otherwise the metadata will not bet set. I was hoping StringTie would take the cDNA fasta file gene names from the Hisat2 BAM file, which are the exact same as the GTF gene_id attribute, and make the connection. But that seems wrong now:

Is it possible to change my GTF file to get StringTie to work with it and a BAM file made from a cDNA fasta file? I’ve tried everything to get the alignment with a full chromosome sequence FASTA, but I can’t get it to work without the cDNA file, so I’m at a loss.

Thank you

FYI, this is the solution. Several of the tools in galaxy don’t seem to work well with “larger” fasta references, even if a specific error isn’t mentioned. So for wheat I had to use half of the fasta sequence of a full chromosome, which gave me a 250 mB file. With this I could successfully assign my data to this reference after normalization.

1 Like

@jste

Yes, the wheat genome assembly is very large and will fail for memory reasons at public Galaxy servers. This can happen no matter how you run the job or what resources are allocated. Many tools simply cannot handle the chromosome length to create indexes, pre-computed (native genome) or on-demand (custom genome).

If you are willing to use an alternative version of the Triticum genome that has been re-organized by PLAZA, along with a matched GFF annotation, please see: https://bioinformatics.psb.ugent.be/plaza/. Example: https://bioinformatics.psb.ugent.be/plaza/versions/plaza_v4_5_monocots/organism/view/Triticum+aestivum

There are plans to add all PLAZA genomes to usegalaxy.org but that is still a work-in-progress. You might also consider running your own Galaxy and installing the genome there (indexed for tools). Ticket with more details and links if interested: https://github.com/galaxyproject/usegalaxy-playbook/issues/187

Thanks for posting back what worked, and hopefully this extra info provides more options. Cloudman Galaxy is a popular choice for scientists. Galaxy itself is always free but commercial storage/computation resources are usually not. AWS has always offered simple-to-apply-online grants for research/learning purposes, plus they have recently expanded that program.

I added a few more tags to your post in case that interests you. Full resources can be also be found here: