I am trying to do a transcriptome de novo assembly using Trinity but it results in errors. I tried to find here there reason for it but I couldn’t figure out what is the problem. The reads were downloded from SRA, and I runned FASTQC on it to check quality encode and it is Sanger/Illumina 1.9, so it seems to be fine. I used fastp to filter by quality and then Trimmonatic to trim sequences, resulting in 4 datasets. I used both paired datasets (R1/R2 (paired)) as input for Trinity but it does not finish and results in error:
"Sequence: TAAAAAAATCTACASequence: Sequence: ACTGCCTGATACis smaller than 25 base pairs, skipping
Sequence: TTCis smaller than 25 base pairs, skipping
CCAGCis smaller than 25 base pairs, skippingSequence: CTTTATCSequence: TGATTGCCCTis smaller than 25 base pairs, skipping
is smaller than 25 base pairs, skipping
Sequence: AAGATAGGCTTTAAis smaller than 25 base pairs, skipping
is smaller than Sequence: ACTAACTGATAAAAACTAAACis smaller than 25 base pairs, skipping
Sequence: TCATTSequence: Sequence: TGAGTAAGAis smaller than 25 base pairs, skippingGGCis smaller than 25 base pairs, skipping
Sequence: ATATTAGTis smaller than 25 base pairs, skipping
Sequence: CCCTATAis smaller than 25 base pairs, skipping
is smaller than Sequence: AACTAGAATTAACis smaller than 25 base pairs, skipping
Sequence: AATCTCAATAATCTACAis smaller than 25 base pairs, skipping
25 base pairs, skipping
Sequence: ATCATT
Sequence: TCACGATTCAGTCCTGGTCis smaller than 25 base pairs, skipping
is smaller than 2525 base pairs, skipping
Sequence: GGTis smaller than Sequence: AACTTTTAAis smaller than 25 base pairs, skipping
Sequence: base pairs, skipping
Sequence: A"
It actually does not show any error message so I’m not sure how to report or find solutions to my trouble, it just stopped there.
An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/iuc/trinity/trinity/2.9.1+galaxy1 .
Details
Execution resulted in the following messages:
Fatal error: Exit code 2 ()
Detected Common Potential Problems
The tool was executed with one or more duplicate input datasets. This frequently results in tool errors due to problematic input choices.
I am not sure why duplicated datasets would be a problem, since I used the paired data from the SRA archive
Tool Parameters
Input Parameter
Value
Are you pooling sequence datasets?
Yes
Paired or Single-end data?
paired
Left/Forward strand reads
##### 18 Trimmomatic on FASTQ splitter on data 8 (R1 paired)
View dataEdit attributesDelete|
|Right/Reverse strand reads|##### 19 Trimmomatic on FASTQ splitter on data 8 (R2 paired)
View dataEdit attributesDelete|
|Strand specific data|false|
|Jaccard Clip options|False|
|Run in silico normalization of reads|True|
|additional_params||
|Minimum Contig Length|200|
|Use the genome guided mode?|no|
|Error-corrected or circular consensus (CCS) pac bio reads||
|Minimum count for K-mers to be assembled|1|
|Job Resource Parameters|no|
Thanks for posting the error report. Would you please send in a bug report now so we can look closer? Please leave all inputs/outputs undelete and include the URL to this topic post in the comments. How-to: Galaxy Training!
Thanks for the reply, I uncompressed the files from Trimmonatic output and it didnt work out aswell (same error as before). I will uncompress quality filtered raw file (dataset 8), use Fastq Splitter (to create the input for trimmonatic) > Trimmonatic (to retain only paired reads and remove unpaired ones) > Trinity to see if it works, will post results as soon as I can.
Quick update, but the error is still occuring. I tried just unzippinng the result from trimmonatic and also unziping the reads after filtering > Fastq Splitter > Trimmonatic > Trinity, but the same error is showing
Dataset Error Report
An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/iuc/trinity/trinity/2.9.1+galaxy1 .
Details
Execution resulted in the following messages:
Fatal error: Exit code 2 ()
Detected Common Potential Problems
The tool was executed with one or more duplicate input datasets. This frequently results in tool errors due to problematic input choices.
Is it the SRA file? I’m not sure how I could figure out if its an inherent problem of the dataset or any problem with trinity itself.
Thanks for leaving the new failed run undeleted in that same history.
I see the problem (missed it the first time). The original fastq format was interleaved/interlaced. This FAQ covers some ways to convert that format to two datasets (one for forward reads, one for reverse reads): NCBI SRA Fastq - Galaxy Community Hub
There are a few tools to use for this purpose:
Manipulate FASTQ reads on various attributes (the tool covered in the FAQ above)
FASTQ de-interlacer on paired end reads
seqtk_seq common transformation of FASTA/Q (plus a few others in the Seqtk tool group)
(Trying the tool Fastq Spitter was a good guess but it isn’t converting the reads to non-interleaved – instead, it splits each read into two reads. This manipulation is what led to the current errors)
Paired-end read data can be organized in a few different ways. Unless a tool form specifically supports interleaved fastq as an input option, split interleaved pairs into two files so tools can interpret the data correctly. NGS data logistics