Unicycler failed - "reads unavailable" -- Inputs not recognized due to an incompatible datatype assignment

Hi. This is my first time using galaxy and I got stuck straight away so it might be a basic error that I’ve done but I have been following step by step tutorials and can’t figure out what is wrong. I could not find any help from previous messages.
I have 2 kinds of errors when trying to run an assembly with Unicycler.
First I have an issue when I try to select the reads I want to use. When I scroll down, I do not see every reads, even though they are loaded properly. When I use the “search” instead and choose the right read, it turns from green to white and “unavailable” appears before the name of the read like in the screen shot below:

I tried to use the very few reads that appear in the scroll down menu and it’s working! But I can’t access all the reads (anything before entry “52” is unavailable, which is most of my data).

When I try to run an assembly anyway with the reads saying “unavailable”, I get the error message:
tool error
An error occurred with this dataset:
Failed to communicate with remote job server.

Can anybody help?
Should I reload all my reads data?
Thanks
Vanina

1 Like

Hi @vanina.guernier

I suspect that the assigned datatype for the fastq data is just fastq.gz, assigned by you in the Upload tool. Expand one of the “unavailable” datasets to check.

If the fastq data is really in compressed fastqsanger format, the datatype fastqsanger.gz should be assigned. The datatype fastq.gz is not enough, and if the quality score encoding is not actually scaled/encoded as Sanger Phred+33 (fastqsanger), or is labeled as compressed (with a “.gz” added) when it is really uncompressed, tools will fail, so you need to check (tool: FastQC). It is a good practice to run FastQC on reads as the initial analysis step to confirm format and content – unless you did QA/QC prior to loading it into Galaxy or a tutorial states that the data is in that format.

If you want Galaxy to do the datatype checks/assignments when using the Upload tool, do not directly assign a datatype but use the “autodetect datatype” option (the default). Compressed fastq data will uncompress with this method and if the quality score encoding matches will be assigned the datatype fastqsanger (no ".gz at the end). If you get a different datatype, then you’ll need to do some manipulations.

Unicycler is a tool that uses uncompressed fastqsanger anyway. If you input compressed fastqsanger.gz, Galaxy will create a hidden uncompressed version of the data to submit to this tool. Most tools accept compressed fastqsanger.gz data, but this one does not, and you’ll save quota space in your account by not duplicating the data in a compressed + uncompressed format.

Note: The “name” of a dataset has nothing to do with the datatype. The name is just a label. The assigned datatype is the metadata tools use to ensure that inputs are in an appropriate format.

These FAQs explain with more details, including how to check if your data really does have fastqsanger (Sanger Phred+33) scaled quality scores: https://galaxyproject.org/support/

Please review and see if this helps!

Hi @jennaj
Thanks for the information. I checked on the format for reads that worked or not:

  • for 55 (which worked) “uploaded fastqsanger.gz file”
  • for 44 (which did not work) “uploaded fastq.gz file”
    So one is fastq and one is fastqsanger but both seems to be compressed, as per .gz
    My understanding is that on Galaxy, an uncompressed file was created for entry 55 using the fastqsanger. gz file, but that could not be done using a fastq.gz file?

I’m not too sure to understand your comment on FastQC. I actually ran FastQC on all my data as well as multiQC, which seemed to show good results, and then I ran Trimmomatic an all reads. The trimmimg seems to also have worked, and I tried to run Unicycler first using the trimmed reads instead of raw data, but this did not work.

Summary results from FastQC for 44 is:
##FastQC 0.11.8

Basic Statistics pass
#Measure Value
Filename 6461x11601MD-SKQ_S1_L001_R1_001_fastq_gz.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 856011
Sequences flagged as poor quality 0
Sequence length 35-301
%GC 31

END_MODULE
So I do not see any problem here? How does this relate to my problem?

So if I summarize, your advice is to unzip the .gz files before uploading them in order to save some space? And then, when uploading them, use the “auto” for the file format attribution?

One last thing: I should probably delete everything and upload everything again so that I start clean. If I click on the little cross on the right panel in the history, will that actually clear the space in my account? I would not want to have data in the background still using some memory.

Thanks for help
Vanina

Your data is in fastqsanger.gz format. There is no need to reload, just change the datatype for these datasets. Do this for each dataset by clicking on the pencil icon to reach the Edit Attribute forms, on the tab for “Datatypes”. These FAQs explain why the tool is not recognizing the fastq.gz inputs and how-to adjust metadata:

Next time, you can set the datatype when loading the data in the Upload tool to avoid the extra steps.

In most cases, you’ll want to preserve the compressed format.

You could also reload the data and use “autodetect” for the datatype. It will uncompress.

Either option is fine – it depends on which tools you plan to use and how large your data is. Since you are running tutorials, the data is small. Later on when working with large data, or start to fill up your account quota space, adjusting how you load data for particular tools will be something you’ll want to consider, to avoid duplicated data/using up quota space for no practical purpose.

Example: If you are starting an analysis that will go through QA steps then mapping, using compressed fastq data all the way through would be fine, and save space.

Tool’s state what the exact expected input formats are when there are no datasets in the history that meet that criteria. So you might want to keep an extra empty history around for that purpose (switch to that history, load the tool form, and review the expected datatypes for each input “select” option).

Two examples with different input requirements:

Unicycler:

Hisat2:

By clicking on the X icon, datasets will be deleted but still recoverable. The same is true when deleting entire histories. There is an extra step to permanently delete (purge) data (datasets or histories) so that it no longer counts toward the quota. See these FAQs for the how-to:

Thanks. It seems to be working for most of my data but not all.
For some of them when I try to edit I have this message:

Edit dataset attributes
Attributes updated, but metadata could not be changed because this dataset is currently being used as input or output. You must cancel or wait for these jobs to complete before changing metadata.

The odd thing is I do not have any job running at the moment…

In the end I managed to edit them all, and it seems that the assemblies are up and running.
Thanks for helping @jennaj

1 Like

@vanina.guernier

Glad things are working now! :slight_smile: