Error in de novo assembly of RNAseq using Trinity

I am trying to do a transcriptome de novo assembly using Trinity but it results in errors. I tried to find here there reason for it but I couldn’t figure out what is the problem. The reads were downloded from SRA, and I runned FASTQC on it to check quality encode and it is Sanger/Illumina 1.9, so it seems to be fine. I used fastp to filter by quality and then Trimmonatic to trim sequences, resulting in 4 datasets. I used both paired datasets (R1/R2 (paired)) as input for Trinity but it does not finish and results in error:

"Sequence: TAAAAAAATCTACASequence: Sequence: ACTGCCTGATACis smaller than 25 base pairs, skipping

Sequence: TTCis smaller than 25 base pairs, skipping
CCAGCis smaller than 25 base pairs, skippingSequence: CTTTATCSequence: TGATTGCCCTis smaller than 25 base pairs, skipping
is smaller than 25 base pairs, skipping
Sequence: AAGATAGGCTTTAAis smaller than 25 base pairs, skipping

is smaller than Sequence: ACTAACTGATAAAAACTAAACis smaller than 25 base pairs, skipping
Sequence: TCATTSequence: Sequence: TGAGTAAGAis smaller than 25 base pairs, skippingGGCis smaller than 25 base pairs, skipping
Sequence: ATATTAGTis smaller than 25 base pairs, skipping
Sequence: CCCTATAis smaller than 25 base pairs, skipping
is smaller than Sequence: AACTAGAATTAACis smaller than 25 base pairs, skipping
Sequence: AATCTCAATAATCTACAis smaller than 25 base pairs, skipping
25 base pairs, skipping
Sequence: ATCATT
Sequence: TCACGATTCAGTCCTGGTCis smaller than 25 base pairs, skipping
is smaller than 2525 base pairs, skipping
Sequence: GGTis smaller than Sequence: AACTTTTAAis smaller than 25 base pairs, skipping
Sequence: base pairs, skipping
Sequence: A"

It actually does not show any error message so I’m not sure how to report or find solutions to my trouble, it just stopped there.

If a job fails you can click that small bug icon on the red colored history “block” to see the error.

This may be related:

1 Like

The error says:

Dataset Error Report

An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/iuc/trinity/trinity/2.9.1+galaxy1 .

Details

Execution resulted in the following messages:

Fatal error: Exit code 2 ()

Detected Common Potential Problems

The tool was executed with one or more duplicate input datasets. This frequently results in tool errors due to problematic input choices.

I am not sure why duplicated datasets would be a problem, since I used the paired data from the SRA archive

Tool Parameters

Input Parameter Value
Are you pooling sequence datasets? Yes
Paired or Single-end data? paired
Left/Forward strand reads ##### 18 Trimmomatic on FASTQ splitter on data 8 (R1 paired)

View dataEdit attributesDelete|
|Right/Reverse strand reads|##### 19 Trimmomatic on FASTQ splitter on data 8 (R2 paired)

View dataEdit attributesDelete|
|Strand specific data|false|
|Jaccard Clip options|False|
|Run in silico normalization of reads|True|
|additional_params||
|Minimum Contig Length|200|
|Use the genome guided mode?|no|
|Error-corrected or circular consensus (CCS) pac bio reads||
|Minimum count for K-mers to be assembled|1|
|Job Resource Parameters|no|

1 Like

Hi @andre.sa

Thanks for posting the error report. Would you please send in a bug report now so we can look closer? Please leave all inputs/outputs undelete and include the URL to this topic post in the comments. How-to: Galaxy Training!

1 Like

Just did, thanks for the help

1 Like

Hi @andre.sa

Try uncompressing the fastq reads before running Trinity.

Pencil icon > Convert > uncompress. Do this on each output from the Fastq Splitter tool or on the original dataset 8, then split.

This post has more details about general usage help for Trinity including tutorial links: too much time for start a job? Troubleshooting Trinity inputs + QA/QC steps for RNA-seq assembly - #4 by jennaj

Thanks!

Thanks for the reply, I uncompressed the files from Trimmonatic output and it didnt work out aswell (same error as before). I will uncompress quality filtered raw file (dataset 8), use Fastq Splitter (to create the input for trimmonatic) > Trimmonatic (to retain only paired reads and remove unpaired ones) > Trinity to see if it works, will post results as soon as I can.

Quick update, but the error is still occuring. I tried just unzippinng the result from trimmonatic and also unziping the reads after filtering > Fastq Splitter > Trimmonatic > Trinity, but the same error is showing

Dataset Error Report

An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/iuc/trinity/trinity/2.9.1+galaxy1 .

Details

Execution resulted in the following messages:

Fatal error: Exit code 2 ()

Detected Common Potential Problems

The tool was executed with one or more duplicate input datasets. This frequently results in tool errors due to problematic input choices.

Is it the SRA file? I’m not sure how I could figure out if its an inherent problem of the dataset or any problem with trinity itself.

1 Like

Hi @andre.sa

Thanks for leaving the new failed run undeleted in that same history.

I see the problem (missed it the first time). The original fastq format was interleaved/interlaced. This FAQ covers some ways to convert that format to two datasets (one for forward reads, one for reverse reads): NCBI SRA Fastq - Galaxy Community Hub

There are a few tools to use for this purpose:

  • Manipulate FASTQ reads on various attributes (the tool covered in the FAQ above)
  • FASTQ de-interlacer on paired end reads
  • seqtk_seq common transformation of FASTA/Q (plus a few others in the Seqtk tool group)

(Trying the tool Fastq Spitter was a good guess but it isn’t converting the reads to non-interleaved – instead, it splits each read into two reads. This manipulation is what led to the current errors)

Paired-end read data can be organized in a few different ways. Unless a tool form specifically supports interleaved fastq as an input option, split interleaved pairs into two files so tools can interpret the data correctly. NGS data logistics

Thanks!