Trimmomatic error: Fatal error

Hi, I have tried running some RNAseq datasets on Trimmomatic but they keep failing. The error I get when clicking on the bug icon is the following:

Fatal error:

When checking the information section, I get “empty” for “Tool standard error”, and “Tool standard output” is as below:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/corral4/main/jobs/046/864/46864652/_job_tmp -Xmx28g -Xms256m
TrimmomaticSE: Started with arguments:
-threads 6 fastq_in.fastqsanger.gz fastq_out.fastqsanger.gz ILLUMINACLIP:/corral4/main/jobs/046/864/46864652/configs/tmptcibpyu4:2:30:10 LEADING:30 TRAILING:30
Using Long Clipping Sequence: ‘ATCGGAAGAGCACACGTCTGAACTCCAGTCACGGTGAACCATCTCGTATG’
Using Long Clipping Sequence: ‘GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTTCCAACGCGTGTAGATCT’
Using Long Clipping Sequence: ‘ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTTCCAACGCGTGTAGATCTC’
Using Long Clipping Sequence: ‘GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGTGAACCATCTCGTAT’
ILLUMINACLIP: Using 0 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at org.usadellab.trimmomatic.util.ConcatGZIPInputStream.read(ConcatGZIPInputStream.java:73)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at org.usadellab.trimmomatic.fastq.FastqParser.parseOne(FastqParser.java:71)
at org.usadellab.trimmomatic.fastq.FastqParser.next(FastqParser.java:179)
at org.usadellab.trimmomatic.threading.ParserWorker.run(ParserWorker.java:42)
at java.lang.Thread.run(Thread.java:745)
Exception in thread “Thread-0” java.lang.RuntimeException: java.io.EOFException: Unexpected end of ZLIB input stream
at org.usadellab.trimmomatic.threading.ParserWorker.run(ParserWorker.java:56)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at org.usadellab.trimmomatic.util.ConcatGZIPInputStream.read(ConcatGZIPInputStream.java:73)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at org.usadellab.trimmomatic.fastq.FastqParser.parseOne(FastqParser.java:71)
at org.usadellab.trimmomatic.fastq.FastqParser.next(FastqParser.java:179)
at org.usadellab.trimmomatic.threading.ParserWorker.run(ParserWorker.java:42)
… 1 more
Input Reads: 17849000 Surviving: 17595136 (98.58%) Dropped: 253864 (1.42%)
TrimmomaticSE: Completed successfully

What I was trying to do, was to run a fastq.gz file and trim the beginning and end bases below a certain quality score (using LEAD and TRAIL commands). I was also trying to trim the adapter sequences shown as overrpresented in FastQC report, entering the specific FASTA sequences shown in the report.

Tried other trimming tools (Trim Galore! and Cutadapt) but they also result in errors (“Fatal error: Exit code 1 ()” and “Fatal error: Exit code 2 ()” respectively). FastQC works fine.

If you could help discovering what went wrong, would really appreciate it - I’m a total newbie and this is something I have been struggling with for the last couple of days.

I think this mostly has to do with corrupted files or a wrong fastq format. If you look up “Trimmomatic java.io.EOFException: Unexpected end of ZLIB input stream” you get some results that might help you. On galaxy there is also a tool called FASTQ info, maybe this can also help. (I am not familiar with this tool)

1 Like

A possibility. I’ve had issues with the original file downloads from NCBI SRA (some files would end up truncated, so I had to re-download them). But then FastQC worked. What method do you usually use to download the datasets and check their integrity?

1 Like

Either or both of those tools can be run to detect truncated datasets. tutorials/quality-control

You didn’t state how you were loading the data – but SRA data from NCBI is best retrieved by accession ID (one or a list) using this tool Faster Download and Extract Reads in FASTQ. The result will be sorted into collections with the proper format/datatype.

1 Like

An update.

So I tried the FASTQ info tool as per gbbio’s suggestion. It indeed showed that some of the datasets I was not able to trim were truncated or otherwise affected (something about sequence and quality not having the same length), despite FastQC reports looking alright. Many of the FASTQ info reports also give “Read name provided with no suffix” message. Some got partial reports with messages about running out of memory.

So I assume these datasets are somehow damaged even if they looked alright in FastQC reports? :confused:

@jennaj I downloaded the datasets using Google Chrome, going to NCBI SRA Run selector, clicking on individual SRR numbers and then “FASTA/FASTQ download”. And then uploaded them manually one by one from my computer onto Galaxy using FastQC tool. Which was extremely inefficient time-wise.

I have now tried running Faster Download and Extract Reads in FASTQ for one of the datasets. I see that it generated paired-end data files. Which is interesting - the previous download method resulted in fastq.gz files that I was able to run only as single-end reads on Trimmomatic or alternative tools despite them labelled as “paired” in SRA database.

I guess my questions are:

  1. Would you recommend to redownload the datasets of interest using this Faster Download and Extract Reads in FASTQ tool and re-rerun the FastQC analyses/trimming?
  2. Would you recommend running additional quality validation via FastQ info?
  3. Is Faster Download and Extract Reads in FASTQ something that would ensure the best speed and reliability ratio (in terms of dataset integrity) for downloads of large datasets?

I’ve been trying to investigate alternative download/upload methods but I understand FTP downloads/uploads are not supported by Galaxy any more?

Would highly appreciate your advice.

1 Like

Yes to all of these.

FastQC only reads in a subset of total reads, and I guess it could miss truncated data. That seems new and unexpected if the input was compressed. If you want to send in an example as a bug report directly from an error, I’d be interested in reviewing it.
FastQ info checks everything.
Faster Download and Extract Reads in FASTQ is a dedicated tool intended for batch data transfers both on Galaxy’s end and NCBI’s.

When navigating NCBI for links directly and pasting those into the Upload tool, some of the metadata is lost and/or the link might be to the “submitted” data version versus the “processed” version. The differences are technical but most want the “processed” and that is what the Faster tool retrieves.

Partial data load should be rare-ish with any method but no one ever regrets running a bit more QA and avoiding problems downstream :slight_smile: That said, NCBI does have super busy times when any data retrieval or sometimes specific types of data retrieval will fail. The solution is to wait a bit then try again.

For large amounts of data, it can help to get things organized at the Upload step. These specific tutorials cover batch concepts and methods.

1 Like

Yes to all of these.

Wonderful.

FastQC only reads in a subset of total reads, and I guess it could miss truncated data. That seems new and unexpected if the input was compressed. If you want to send in an example as a bug report directly from an error, I’d be interested in reviewing it.
FastQ info checks everything.

Thank you, that explains a lot.

As for the bug reports - so the below is the tool standard error for the FastQC output of the dataset:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/corral4/main/jobs/046/559/46559587/_job_tmp -Xmx28g -Xms256m
java.io.EOFException: Unexpected end of ZLIB input stream
at java.base/java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:245)
at java.base/java.util.zip.InflaterInputStream.read(InflaterInputStream.java:159)
at java.base/java.util.zip.GZIPInputStream.read(GZIPInputStream.java:118)
at uk.ac.babraham.FastQC.Utilities.MultiMemberGZIPInputStream.read(MultiMemberGZIPInputStream.java:68)
at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:138)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:77)
at java.base/java.lang.Thread.run(Thread.java:834)

I see now that it has the same “Unexpected end of ZLIB input stream” issue. But the FastQC output reports came up green, without any bug icons flagged up (both the Webpage and RawData) or anything else that would have suggested truncation. Just to verify - if I ran Trimmomatic (or something else) on this dataset, this report remains as it was until I run FastQC on the trimmed dataset, is that correct?

This was my first run that worked, so I trusted green until the trimming step :frowning:

The FastQ info report also came green, without bugs. The tool standard error was empty, but this is what the output looked like:

fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from /corral4/main/objects/9/4/8/dataset_94835bb6-102b-4cdf-9484-a19ea90a0d6c.dat
Read name provided with no suffix
10000020000030000040000050000060000070000080000090000010000001100000120000013000001400000150000016000001700000180000019000002000000210000022000002300000240000025000002600000270000028000002900000300000031000003200000330000034000003500000360000037000003800000390000040000004100000420000043000004400000450000046000004700000480000049000005000000510000052000005300000540000055000005600000570000058000005900000600000061000006200000630000064000006500000660000067000006800000690000070000007100000720000073000007400000750000076000007700000780000079000008000000810000082000008300000840000085000008600000870000088000008900000900000091000009200000930000094000009500000960000097000009800000990000010000000101000001020000010300000104000001050000010600000107000001080000010900000110000001110000011200000113000001140000011500000116000001170000011800000119000001200000012100000122000001230000012400000125000001260000012700000128000001290000013000000131000001320000013300000134000001350000013600000137000001380000013900000140000001410000014200000143000001440000014500000146000001470000014800000149000001500000015100000152000001530000015400000155000001560000015700000158000001590000016000000161000001620000016300000164000001650000016600000167000001680000016900000170000001710000017200000173000001740000017500000176000001770000017800000
ERROR: Error in file /corral4/main/objects/9/4/8/dataset_94835bb6-102b-4cdf-9484-a19ea90a0d6c.dat: line 71398384: file truncated

The numbers did not come up as a block in the report, but were separated out by series of squared question marks (15 in between each number).

:confused:

1 Like

Thanks for following up! I reviewed carefully and think everything is working how it is expected to.

  1. FastQC

uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn’t start with ‘@’

Input dataset 8 – datatype/format sra_manifest.tabular
Output dataset 9 – “red” errored result

The tool is expecting fastq reads and filters for those in the input drop down menu of potential datasets. Dataset 8 is not in the last listing. I’m guessing that you dragged-and-dropped the dataset instead (?) That can lead to errors since it is an “override” function that bypasses initial format checks.

If a dataset is not in the listing for a tool’s input area (any) – either the datatype/format needs to be modified or the content isn’t a match for what that tool can manipulate. #tool-doesnt-recognize-input-datasets

Several of the early FastQC runs in your history are similar. The input wasn’t a content match, and at least one input was in an error state.

  1. FastQC

java.io.EOFException: Unexpected end of ZLIB input stream

Input dataset 45 – datatype/format sra_manifest.tabular
Output dataset 46 – “green” (putatively) successful result

The output was purged so I can’t see all of it, just a bit from the “raw” output (none of the HTML). The input does appear to be in a fastq format. The datatype was compressed “fastqsanger.gz”. I can’t check the actual contents except for the first few lines. Those look to be a hybrid between “submitted” and “processed” SRA reads that NCBI hosts.

If you really want to investigate, this older FAQ #ncbi-sra-sourced-fastq-data has a list of all the manipulations we knew about to standardize the various read formats that NCBI hosts. Why so much variation from NCBI in the past? Changes over time in sequencing protocols and the SRA submission process itself and in how NCBI creates/standardizes the “processed” read data then versus now, and probably some other historical things I don’t know or can’t remember :slight_smile:

Confusing but shouldn’t matter now, as format issues will be avoided with the Faster tool, and you are using that now.

  1. Fastq info

This is different. It is the output from the tool, and it happens to also be in a job log.

fastq_utils 0.25.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from /corral4/main/objects/9/4/8/dataset_94835bb6-102b-4cdf-9484-a19ea90a0d6c.dat
Read name provided with no suffix
> 10000020000030000040000050000060000070000080000090000010000001100000120000013000001400000150000016000001700000180000019000002000000210000022000002300000240000025000002600000270000028000002900000300000031000003200000330000034000003500000360000037000003800000390000040000004100000420000043000004400000450000046000004700000480000049000005000000510000052000005300000540000055000005600000570000058000005900000600000061000006200000630000064000006500000660000067000006800000690000070000007100000720000073000007400000750000076000007700000780000079000008000000810000082000008300000840000085000008600000870000088000008900000900000091000009200000930000094000009500000960000097000009800000990000010000000101000001020000010300000104000001050000010600000107000001080000010900000110000001110000011200000113000001140000011500000116000001170000011800000119000001200000012100000122000001230000012400000125000001260000012700000128000001290000013000000131000001320000013300000134000001350000013600000137000001380000013900000140000001410000014200000143000001440000014500000146000001470000014800000149000001500000015100000152000001530000015400000155000001560000015700000158000001590000016000000161000001620000016300000164000001650000016600000167000001680000016900000170000001710000017200000173000001740000017500000176000001770000017800000
ERROR: Error in file /corral4/main/objects/9/4/8/dataset_94835bb6-102b-4cdf-9484-a19ea90a0d6c.dat: line 71398384: file truncated

Input dataset 32 – datatype/format fastqsanger.gz
Output dataset 253 – “green” successful result

How to interpret this: the Fastq info tool ran successfully (didn’t fail itself) and it reported back a problem it found in the input. This test history I created a few months ago has an example that matches your use case (artificially created). I’ll leave this shared as a reference. Galaxy

How to catch these: These are text files so are simple to parse/filter/combine. Even if all the data is in a collection, you’ll be able to tell which datasets any errors are from. I just did this two different ways in that shared history as examples and there are likely 100+ ways to do this. Anything “text manipulation” you can do on the command line can usually be done in Galaxy, with the bonus of being able to add that to workflows, see data-manipulation-olympics/tutorial.html#cheatsheet

Thanks again and hope that helps!

1 Like
  1. FastQC

Yes, I did drag and drop the datasets I downloaded from NCBI SRA database to upload them onto Galaxy. So maybe it was that that caused/contributed to the errors.

The output was purged so I can’t see all of it, just a bit from the “raw” output (none of the HTML). The input does appear to be in a fastq format. The datatype was compressed “fastqsanger.gz”. I can’t check the actual contents except for the first few lines. Those look to be a hybrid between “submitted” and “processed” SRA reads that NCBI hosts.

My apologies. I deleted some files for fear I may run out of memory. So basically, the earlier FastQC output files were in a mish-mash format that could not be read by downstream tools, is that correct?

I have now imported all the files using Faster Download and Extract Reads in FASTQ. That seemed to work, I now have outputs for forward and reverse reads instead of single files. So I re-ran the FastQC analyses and FASTQ info validation in parallel.

3. FASTQ info
None of the files I was able to run were truncated. However, they still seem to have the same issue - “Read name provided no suffix” and provide those number/character lists (similar to the entry “6 FASTQ info on data 2 and data 1: Validation” in the history you shared).

Are these the expected outcomes or do they also suggest that I will have downstream issues (e.g. when trimming or aligning the reads)? If so, can I somehow bypass or sort that out?

What I am concerned about, is that all the “FastQC” outputs had this message for “Tool standard error”:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/corral4/main/jobs/046/912/46912172/_job_tmp -Xmx28g -Xms256m

Is that something I should be concerned about?

Thank you so much for such extensive help - I really appreciate this :blush: