I’m running Galaxy on a local on-premiss server and developing using Rscript.My tool works fine Up to the point of Rscript execution, but finally returns “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte”. What are the possible causes?
Galaxy ver.: v20.01
Python ver.: 3.6.8
OS: CentOS 8
output file format: txt
galaxy.model.metadata DEBUG 2023-05-25 15:47:41,938 [p:1938833,w:0,m:1] [DRMAARunner.work_thread-3] loading metadata from file for: HistoryDatasetAssociation 502
galaxy.jobs.runners ERROR 2023-05-25 15:47:41,944 [p:1938833,w:0,m:1] [DRMAARunner.work_thread-3] (303/546) Job wrapper finish method failed
Traceback (most recent call last):
File "lib/galaxy/jobs/__init__.py", line 1522, in _finish_dataset
dataset.set_peek(line_count=line_count)
TypeError: set_peek() got an unexpected keyword argument 'line_count'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
File "lib/galaxy/jobs/__init__.py", line 1657, in finish
output_name, dataset, job, context, final_job_state, remote_metadata_directory
File "lib/galaxy/jobs/__init__.py", line 1525, in _finish_dataset
dataset.set_peek()
File "lib/galaxy/model/__init__.py", line 2621, in set_peek
return self.datatype.set_peek(self)
File "lib/galaxy/datatypes/data.py", line 903, in set_peek
est_lines = self.estimate_file_lines(dataset)
File "lib/galaxy/datatypes/data.py", line 854, in estimate_file_lines
dataset_read = dataset_fh.read(sample_size)
File "/home/galaxy/galaxy/.venv/lib64/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Is the format of the data a defined datatype? If not, maybe try defining the datatype? Or, add more datatypes that are already defined but might not be in the older release you are working in.
You are working at an older release of Galaxy. Python support for version 2 & 3 was still mixed. So when you search the docs, navigate by release.
Other “reasons” I can think of:
!s the dataset (file) name something that wouldn’t play well with Rscript? It is better to reference datasets than to use the original file name. We only have a handful of tools that require the original file name, and those are difficult for people to use. Why? Because it is a value users can modify, and dependent on what other tools decide as the default naming (spaces, dots, other odd characters). So, if that situation can be avoided it is generally a “better” strategy.
I suppose that could also be something based on the file content. Unexpected compression could be involved: binary vs plain text. But this loops back to defined datatypes. If your tool is not screening for appropriate input assigned datatypes, and knows how to check/manipulate which it is working with (extra uncompress loop, etc) consider adding that in. And remember, some binary data is not appropriate for a “dataset peek view”. Example: bam.
A web search shows some solutions in various contexts. And, it came up a few times at our older forum. Some examples:
You can also ask for help at the tool development chat here. Updating your development environment might be recommended. You're invited to talk on Matrix
Thank you very much for providing various helpful information. Regarding the datatype defined as “txt” in the output section of the tool XML file, to be honest, I’m not entirely sure about it since I’m not the author of the Rscript. I think it would be a good idea to reach out to the author of the Rscript and inquire about it.