UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Hi all,

I’m running Galaxy on a local on-premiss server and developing using Rscript.My tool works fine Up to the point of Rscript execution, but finally returns “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte”. What are the possible causes?

  • Galaxy ver.: v20.01
  • Python ver.: 3.6.8
  • OS: CentOS 8
  • output file format: txt
galaxy.model.metadata DEBUG 2023-05-25 15:47:41,938 [p:1938833,w:0,m:1] [DRMAARunner.work_thread-3] loading metadata from file for: HistoryDatasetAssociation 502
galaxy.jobs.runners ERROR 2023-05-25 15:47:41,944 [p:1938833,w:0,m:1] [DRMAARunner.work_thread-3] (303/546) Job wrapper finish method failed
Traceback (most recent call last):
  File "lib/galaxy/jobs/__init__.py", line 1522, in _finish_dataset
    dataset.set_peek(line_count=line_count)
TypeError: set_peek() got an unexpected keyword argument 'line_count'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "lib/galaxy/jobs/__init__.py", line 1657, in finish
    output_name, dataset, job, context, final_job_state, remote_metadata_directory
  File "lib/galaxy/jobs/__init__.py", line 1525, in _finish_dataset
    dataset.set_peek()
  File "lib/galaxy/model/__init__.py", line 2621, in set_peek
    return self.datatype.set_peek(self)
  File "lib/galaxy/datatypes/data.py", line 903, in set_peek
    est_lines = self.estimate_file_lines(dataset)
  File "lib/galaxy/datatypes/data.py", line 854, in estimate_file_lines
    dataset_read = dataset_fh.read(sample_size)
  File "/home/galaxy/galaxy/.venv/lib64/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Thanks,
Yukie

Hi @yukieymd

Is the format of the data a defined datatype? If not, maybe try defining the datatype? Or, add more datatypes that are already defined but might not be in the older release you are working in.

Example search with “set_peek”. Search — Galaxy Project 20.01 documentation

You are working at an older release of Galaxy. Python support for version 2 & 3 was still mixed. So when you search the docs, navigate by release.

Other “reasons” I can think of:

  1. !s the dataset (file) name something that wouldn’t play well with Rscript? It is better to reference datasets than to use the original file name. We only have a handful of tools that require the original file name, and those are difficult for people to use. Why? Because it is a value users can modify, and dependent on what other tools decide as the default naming (spaces, dots, other odd characters). So, if that situation can be avoided it is generally a “better” strategy.

  2. I suppose that could also be something based on the file content. Unexpected compression could be involved: binary vs plain text. But this loops back to defined datatypes. If your tool is not screening for appropriate input assigned datatypes, and knows how to check/manipulate which it is working with (extra uncompress loop, etc) consider adding that in. And remember, some binary data is not appropriate for a “dataset peek view”. Example: bam.

  3. A web search shows some solutions in various contexts. And, it came up a few times at our older forum. Some examples:

  1. Planemo for tool development: Welcome to Planemo’s documentation! — Planemo 0.75.11 documentation

  2. You can also ask for help at the tool development chat here. Updating your development environment might be recommended. You're invited to talk on Matrix


Others can comment, these are just my guesses :slight_smile:.

Hi @jennaj,

Thank you very much for providing various helpful information. Regarding the datatype defined as “txt” in the output section of the tool XML file, to be honest, I’m not entirely sure about it since I’m not the author of the Rscript. I think it would be a good idea to reach out to the author of the Rscript and inquire about it.

Best,
Yukie

1 Like