Trinity genome-guided assembly produces empty files despite the job being completed successfully

Trinity genome-guided assembly produces empty files despite the job being completed successfully. I obtained the same unexpected results also on galaxy.org.


Hi @Lukasz_Gajda

Are you still having trouble? If so, it would be helpful for you to provide more information.

If possible, please create a share link to the history, paste that back here as a reply, and note which datasets are involved. Please avoiding deleting any of the inputs or outputs please.

An alternative to start with could be the Dataset information page associated with one of the empty outputs, and each of the inputs. This includes the tool, input, and parameter details. Click on the :information_source: icon with a dataset to find it. Copy the text and paste it back please.

Thanks!

Dear @jennaj,

Thank you very much for contacting me. Firstly, I noticed that I apparently used an unsorted bam file in my last attempt, which is bad of course. However, I also obtained empty files in the previous try with a Samtools-sorted bam. De novo assembly works fine for me. I must admit that the genome-guided assembly with Trinity is something new for me.

Link to the history:

https://usegalaxy.eu/u/ruptawsky/h/unnamed-history

Previous attempt:

6: Trinity on data 4, data 2, and data 1: Assembled Transcripts
Note: input bam file was prepared with STAR in OmicsBox and sorted with Samtools in Galaxy

The latest attempt:

27: Trinity on data 26, data 2, and data 1: Assembled Transcripts
Note: input bam file was prepared with STAR in Galaxy

I am grateful for your support! If you need more details, feel free to ask.

Hi @Lukasz_Gajda

It looks like you deleted the empty outputs, so I couldn’t review the exact parameters you used, but I could guess the inputs and reviewed those.

The tool is probably having trouble mapping the IDs in the GTF to the custom genome’s IDs. That can lead the removal of assemblies in the output (why? no annotated splices). This can be addressed by cleaning up the fasta and GTF formatting.

The custom genome fasta itself is a bit large (numerous sequences), but try the reformatting, then a rerun using those modified inputs. If you get an error or odd results, please leave those results undeleted and we can take another look.

Dear@jennaj,

I’m not sure, but I think we are talking about different things here. Sorry if I misunderstood you. What tool are you talking about? Do you mean I messed up something with the input bam file i.e., at the RNA-seq alignment with STAR step? In order to run Trinity assembly in genome-guided mode, you must provide read alignments to Trinity as a coordinate-sorted bam file. In that case, your input files for Trinity are: RNA-seq reads (trimmed and filtered) + bam file (RNA-seq reads aligned with reference genome with STAR and sorted with Samtools). I did not delete any Trinity’s results. Trinity produces Assembled transcripts and its own Gene to Transcripts map. These are:

  • for previous attempts: 7 Trinity on data 4, data 2, and data 1: Gene to transcripts map AND 6 Trinity on data 4, data 2, and data 1: Assembled Transcripts

  • for the latest try: 28 Trinity on data 26, data 2, and data 1: Gene to transcripts map AND 27 Trinity on data 26, data 2, and data 1: Assembled Transcripts.

For RNA-seq alignment with STAR, I used my Cyprinus carpio RNA-seq reads, while the reference genomic fasta and GFF files were downloaded from GenBank.

Thanks again,

1 Like

These are the two inputs that may have had issues (technical). By default those files will not be in a strict format from that source. Some tools produce odd results when the identifiers cannot be matched up. I recommended cleaning up data (always) to eliminate easy problems, but not everyone does :slight_smile:

Also, I reloaded the history and can see the empty Trinity results now. Changing the first item might be enough. If not, consider the other two.

  1. Use a genomic mapping BAM result, not the transcriptome.
  2. Simplify the reference genome. Maximizing unique hits (scientific result quality) by omitting the unplaced scaffolds could be tested/compared. NCBI reports the BUSCO around 60% duplicated. Seems high, but don’t know if that is expected for this species or not.
  3. Simplify the reads. If either end is shorter than 25 bases, both ends are sorted out by Trinity. The job logs list many of these. May not matter but could be pre-filtered to shorten the runtime.

The job logs included this. It means that the transcriptome bam didn’t fully parse with samtools. If the genomic also fails at this step, then definitely consider using primary chromosomes only.

Tuesday, September 6, 2022: 08:37:15 CMD: /usr/local/tools/_conda/envs/mulled-v1-8bfd939b6449b852af28c673a66720972cbc5543a7cc8d50ff9c6e1f04709fb3/opt/trinity-2.9.1/util/support_scripts/ensure_coord_sorted_sam.pl localbam.bam
-appears to be a coordinate sorted bam file. ok.
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1

Hope that helps!

Hi @jennaj
After further investigation of the issue, I believed that the current Galaxy version of Trinity has a bug.
In order to test Trinity genome-guided mode assembly on Galaxy, I used pre-prepared files from the OmicBox tutorial, i.e.,
Fastq reads (single-end data) from Neisseria gonorrhoeae (SRR3666079) and sorted BAM file (BWA RNA-seq alignment of the SRR3666079 reads with N. gonorrhoeae reference genome - Neisseria_gonorrhoeae_fa_1090.ASM684v1.dna.toplevel.fa). All files were downloaded from:
http://manual.omicsbox.biobam.com/example-datasets/transcriptomics/gene-level-analysis/#Gene-LevelAnalysis-1-RNA-SeqAlignment
The resulting Trinity file that should contain assembled transcripts was empty again (with the same warning: “samtools view: error closing standard output: -1”). I also obtained an empty result with another run of my Cyprinus carpio data (with a genomic mapping (sorted) BAM result this time).
Despite this, in fact, the same bug (‘Trinity guided mode output file is not captured’) was reported for Trinity and fixed in version 2.13.2.

References for the bug:

Link to history:
https://usegalaxy.eu/u/ruptawsky/h/trinity-genome-guided-test

Feel free to correct me if I missed something.
I could be wrong in this, but are there any plans to upgrade the Trinity version on Galaxy?
Thank you for your support.
P.S. Common carp (Cyprinus carpio) is an allotetraploid, so the duplication level is as expected.

1 Like

Hi @Lukasz_Gajda

Thanks for all of the follow up!

The problem when running this in Galaxy is actually the wrapper (not the underlying tool version). Ticket: Trinity: Bug in output collection when using genome-guided mode · Issue #3850 · galaxyproject/tools-iuc · GitHub. And, I’ve asked the Tools WG to take a look again. If you want to help make the changes, please comment on the ticket and reach out to the Tools working group for coordination.

I thought it was fixed already, but it wasn’t – sorry! I’ve added your test history to the ticket with a few more manipulations/tests done by me. If you want to purge your copy to recover storage space in your account, that would be Ok now.