Error with Trinity De Novo Assembly in GVL 4.4.0 install

We’ve been experiencing persistent error pasted below with our lab’s cloud-based installation of the Galaxy suite via the Genomics Virtual Laboratory 4.4.0. Specifically, whenever we attempt a Trinity de novo assembly of a collection of Illumina TruSeq stranded mRNA paired end reads, we get the errors below. We’ve had no problems with these same data sets when using other tools like HiSat2 and StringTie. We’ve poked around in the “manage tools” and “manage dependencies” area of the Galaxy Admin interface and updated to the latest version of Trinity and checked dependencies, and all seems to be OK; however, we must confess to being rank amateurs. Can anyone point us toward some other things we might try?

Fatal error: Exit code 2 ()
Possible unintended interpolation of @2 in string at
/mnt/galaxy/tool_dependencies/_conda/envs/__trinity@2.8.4/bin/Trinity line 46.

Tool Errors

Internal Tool Error log

Time Phase File Error
2019-02-14 03:03:05.105596 Tool Loading ./tools/phenotype_association/sift.xml [Errno 2] No such file or directory: ‘/mnt/galaxy/galaxy-app/tool-data/sift_db.loc’
2019-02-14 03:03:05.104204 Tool Loading ./tools/evolution/add_scores.xml [Errno 2] No such file or directory: ‘/mnt/galaxy/galaxy-app/tool-data/add_scores.loc’
2019-02-14 03:03:05.103428 Tool Loading ./tools/evolution/codingSnps.xml [Errno 2] No such file or directory: ‘/mnt/galaxy/galaxy-app/tool-data/codingSnps.loc’
2019-02-14 03:03:05.100147 Tool Loading ./tools/extract/liftOver_wrapper.xml ‘liftOver’

Also found additional error information via the Galaxy | Reports interface:

> Error, 1 threads errored out at /mnt/galaxy/tool_dependencies/_conda/envs/__trinity@2.8.4/opt/trinity-2.8.4/util/insilico_read_normalization.pl line 963.
> Error, cmd: /mnt/galaxy/tool_dependencies/_conda/envs/__trinity@2.8.4/opt/trinity-2.8.4/util/insilico_read_normalization.pl --seqType fq --JM 1G  --max_cov 200 --min_cov 1 --CPU 4 --output /mnt/galaxy/tmp/job_working_directory/000/104/working/trinity_out_dir/insilico_read_normalization   --max_pct_stdev 10000  --SS_lib_type FR  --left /mnt/galaxy/files/000/dataset_64.dat,/mnt/galaxy/files/000/dataset_67.dat,/mnt/galaxy/files/000/dataset_70.dat,/mnt/galaxy/files/000/dataset_73.dat --right /mnt/galaxy/files/000/dataset_65.dat,/mnt/galaxy/files/000/dataset_68.dat,/mnt/galaxy/files/000/dataset_71.dat,/mnt/galaxy/files/000/dataset_74.dat --pairs_together --PARALLEL_STATS   died with ret 6400 at /mnt/galaxy/tool_dependencies/_conda/envs/__trinity@2.8.4/bin/Trinity line 2689.
> 	main::process_cmd("/mnt/galaxy/tool_dependencies/_conda/envs/__trinity\@2.8.4/opt"...) called at /mnt/galaxy/tool_dependencies/_conda/envs/__trinity@2.8.4/bin/Trinity line 3235
> 	main::normalize("/mnt/galaxy/tmp/job_working_directory/000/104/working/trinity"..., 200, ARRAY(0x55a0498484e8), ARRAY(0x55a0498484d0)) called at /mnt/galaxy/tool_dependencies/_conda/envs/__trinity@2.8.4/bin/Trinity line 3182
> 	main::run_normalization(200, ARRAY(0x55a0498484e8), ARRAY(0x55a0498484d0)) called at /mnt/galaxy/tool_dependencies/_conda/envs/__trinity@2.8.4/bin/Trinity line 1319
1 Like

This looks like a tool wrapper bug (conda path quoting).

It might present in some Trinity tool wrapper versions and not in others.

GVL 4.4.0 was released in September 2018 and is based on the Galaxy release 18.05. The most current version of the Trinity wrapper from the IUC was released in October 2018. The most current Galaxy version is now 19.01.

I suggest updating the Trinity wrapper to the most current IUC version and testing that out. If the latest version still presents with the problem, please write back and we can help to get a ticket opened to review the issue, learn the (actual, not guessed) root cause, and get it fixed.

Since we are newbies and rely on the GVL interface for an easy Galaxy deployment in AWS, we were stuck with Galaxy release 18.05, as the website isn’t offering 19.01. That being said, I updated the version of Trinity running in our instance to 2.8.4 (22; 2018-10-17), and we checked and corrected any issues with the tool dependencies, and uninstalled the older versions of Trinity. This seemed to fix most of the problems we were having. We are still running into memory limitations for some data sets, as we are limited to the size of virtual computer we can use in the regions where GVL 4.4.0 emulations are automatically deployed, but we are trying to figure out ways to downsample our RNAseq data to get around this problem. We appreciate you reaching out!

1 Like

Glad the updated version is working, or mostly working!

And I would agree, downsampling is probably the best option if memory resources are a factor.

Take care

Can you recommend a best practice tool for down sampling or point us to a web resource describing the generally preferred approaches? Would save us a lot of trial-and-error time.

Many thanks!

1 Like

Random sampling is probably always best for expression data like RNA-seq reads.

There are a few ways to get a random sample. The tool seqtk_sample is straightforward to use. If this is not on your server yet, the Seqtk repository is here in the ToolShed: https://toolshed.g2.bx.psu.edu/view/iuc/seqtk/a09586d5149a

The simple path would be to subsample one of the inputs (example: the R1 reads) then use a tool like Trimmomatic or both Fastq Interlacer/Deinterlacer to end up with two matched paired-end R1/R2 datasets.

The tool seqtk_subseq will subset fastq datasets based on an extracted list of IDs but that would be more steps and involve sequence identifier transformation steps. The pairs could also be interlaced first (or maybe they are already), with a random sample output generated with seqtk_sample or seqtk_seq, then any unmatched pairs removed with the tool seqtk_dropse, followed by another two rounds of seqtk_seq to end up with two paired datasets (R1 + R2).

Doing some QA/QC first will make sure that the highest quality reads are already present in your data before downsampling, or you could do this after or as part of the processing (example: FastQC > Trimmomatic > FastQC).

The important part is that Trinity requires two matched inputs per paired sample (R1 reads in one, R2 reads in the other). It does not accept interlaced/interleaved fastq data.

You can decide the best path – a longer combination of tools could always be used in a workflow that deletes intermediate datasets, etc, to manage data duplication/space.