SPAdes/Unicycler assembly failures

assembly
unicycler
spades

#1

Hi, this is my first time reporting an issue, so I apologize if I overlook info, can’t format code correctly in this flavor of markdown, etc. I’ve submitted a bug report via the Galaxy server, but, at the risk of redundancy, wanted to ask for help here to try to get the issue resolved sooner in case someone knows what might be going wrong. I can’t find a previous mention of this particular problem in GalaxyHelp or via Google.

I’m trying to assemble a genome on the Galaxy main server (“version_major”: “18.09”), and have trimmed my reads (SRR5906897, Trimmomatic headcrop 12, slidingwindow 4, 25, minlen 100). When I attempt a de novo assembly via Unicycler, however, it consistently fails with the message:

Error: SPAdes crashed! Please view spades.log for more information.

After issuing the bug report, I received an email with the additional stdout code that appears to have the main issues in the following lines:


libgomp: Thread creation failed: Resource temporarily unavailable

== Error == system call for: "['/pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/spades-3.12.0-1/bin/spades-hammer', '/pylon5/mc48nsp/xcgalaxy/main/staging/21785278/working/spades_assembly/read_correction/corrected/configs/config.info']" finished abnormally, err code: 1

I’m uncertain whether the crucial issue is the “Resource temporarily unavailable” line or whether spades-hammer can’t access the config.info file indicated. I’ve tried re-running the job with " Skip SPAdes error correction step" set to “Yes”, but I appear to receive the same error. I’m assuming this is a legitimate bug that will be hard to resolve on my end, but would appreciate any received wisdom as to the best course of action. Can anyone advise?

Thanks!


#2

Hi, Thanks for reporting the problem. I was looking at your history and the inputs are Ok. Am running a quick test job to see if the problem is server side or cluster, plus looking at the input closer just to rule out format/content issues.

Feedback once done


#3

Unicycler is working fine on test data. For your failure, it was due to the input reads not being paired up.

Make sure to only input matched forward/reverse reads. The tools Fastq Interlacer and De-interlacer can be used to filter out reads that are not paired. Then, try another rerun.

Galaxy tutorials for Assembly: https://galaxyproject.github.io/training-material/topics/assembly/


#4

Hmm… Thanks for looking into it, but I don’t think that’s the problem. I hadn’t bothered to check that there weren’t any unpaired reads in the inputs, since they were output by Trimmomatic as the relevant paired read files, but when I put it through the Fastq Interlacer today, the tool outputs the message:

There were 0 single reads.
Interlaced 471127 pairs of sequences.

I then put the reads through the De-Interlacer, and submitted those read files to Unicycler, and it outputs the same error for me. So I can’t account for the difference from the results generated with your test data, but I’m pretty sure it’s not because of the presence of unpaired reads.


#5

The fastq datasets were of different sizes with a differing number of reads (originally). They inputs how have the same number of reads.

However, the new run had the \2 reads entered for forward and \1 reads entered for reverse on the tool form. The ordering should be swapped. I rerunning your job that way to see what results. You could also try this now - I am expecting it to work.

I am also running some other tests. This tool was updated and I want to make sure that:

  • compressed fastqsanger.gz inputs are not a problem (shouldn’t be, or that’s something to fix)
  • the sequence identifiers containing a space is not a problem (again, shouldn’t be)

I’ll write back once all the tests complete.


#6

This did work. So, please also try reversing the order of the reads in your rerun job.

The other items I am still testing but given that result do not expect any problems (the first test rules out the other issues).


#7

Ok, I’m running the Unicycler again with the interlaced-then-de-interlaced reads in the reverse order now, and will report back if it doesn’t work (these runs have usually taken a while to error out, in my experience). However, I still have some questions:

  1. I see now that you’re right that the de-interlaced reads were in reverse order, with the output file that received the automatically-generated name “FASTQ de-interlacer left mates…” contains reads whose IDs all end with ‘/2’, indicating they were originally the reverse read from the original set. However, when I ran the FASTQ interlacer tool in the first place, I made sure to pass the “…897_1.fastq.gz” to “Left-hand mates”, and “…897_2.fastq.gz” to “Right-hand mates”. If the FASTQ de-interlacer tool is going to automatically append names to the output as “left mate” and “right mate”, it’s confusing that it doesn’t maintain the same order of those names as was passed in arguments to the interlacer tool. Unless I made a mistake somewhere in there, it seems to be unfortunate behavior on the part of the tool.
  2. I still don’t get different numbers of reads in the output from Trimmomatic; can you tell me how you’re detecting that? Whether I check by loading the files into BioPython and then checking the number of records, or if I use the more crude measure of wc -l from Bash, I count the same number of forward and reverse reads, whether comparing them among the untrimmed files, as well as for the paired outputs from Trimmomatic. I definitely see that the two files are of different sizes, but you seem to be suggesting that the problem SPAdes is encountering is that the number of reads passed is different between the two, and I can’t confirm that on my end.

…And in the time it took to write all of that, Unicycler has crashed again, with the same error message about SPAdes, even for the job in which I had passed “63: FASTQ de-interlacer right mates…” to “Select first set of reads”, and so on; the reverse orientation as I had tried before. So I’m still not sure what’s going wrong.


#8

Thanks for writing back. Here are the stats and I found out what the issues are around the failures. Some tools are picky about format. The testing became complex but there is a solution.

  • Matched reads must be provided in the forward and reverse inputs (only)
  • Those reads cannot contain internal white spaces in the header. This seems to cause problem both with the Deinterlacer and Unicycler tool.

Inputs/QA processing

Original read count, using FastQC:

  • SRR5906897 forward (R1): 861794
  • SRR5906897 reverse (R2): 861794

Post-Trimmomatic (run1 parameters applied) read count, using FastQC:

  • SRR5906897 forward (R1) paired: 732201
  • SRR5906897 reverse (R2) paired: 732201
  • SRR5906897 forward (R1) unpaired: 42532
  • SRR5906897 reverse (R2) unpaired: 6298

Post-Trimmomatic (run2 parameters applied) read count, using FastQC:

  • SRR5906897 forward (R1) paired: 471127 > input to failed Unicycler assembly
  • SRR5906897 reverse (R2) paired: 471127 > input to failed Unicycler assembly
  • SRR5906897 forward (R1) unpaired: 156090
  • SRR5906897 reverse (R2) unpaired: 18560

Post interlacer read count, on Trimmomatic run2, using FastQC:

  • SRR5906897 Pairs: 942254 total seqs, 471127 pairs
  • SRR5906897 Singles: 0

Post deinterlacer read count, on Trimmomatic run2 interlaced result, using FastQC:

  • SRR5906897 left mates (ended up with R2 reads): 471127 > input to failed Unicycler assembly
  • SRR5906897 right mates (ended up with R1 reads): 471127 > input to failed Unicycler assembly
  • SRR5906897 left singles (R2): 0
  • SRR5906897 right singles (R1): 0

How to fix the data to get a successful run

Using the interlacer/deinterlacer doesn’t seem to matter, Trimmomatic took care of that part. The problem is with the spaces contained within the sequence identifiers.. Both the Deinterlacer and Unicycler tool have trouble interpreting these sequences.

  1. Start with either the Trimmomatic or Deinterlacer paired fastqsanger.gz datasets. Trimmomatic/Deinterlacer paired results are identical in content.
  2. Pick which you want to use (Trimmomatic is fine). Click on the pencil icon for each of the datasets, then under the “Convert” tab, and uncompress the fastq data. The next tool needs uncompressed plain text input.
  3. Remove the space from the sequence identifiers in the uncompressed fastq data. I replaced it with an underscore, using the tool Text transformation with sed and the sed program: s/ ([0-9])/_\1/
  4. Input those final, paired, no-spaces-in-identifier fastqsanger results to Unicycler. You can input the R1/R2 reads in either order and get a successful run, and I didn’t compare those, yet would recommend putting the R1 reads first (forward) and the R2 reads second (reverse) to run the tool as this is consistent with the actual data content.

All other tests with different data variations/formatting failed.

None of this is extra manipulation is ideal so I’ll bring it up with the developers to see what can be done (if anything). 3rd party wrapped tools often have specific formatting expectations. Sometimes the Galaxy wrapper around the tool can adjust formats that trigger errors and sometimes not, requiring that the data be prepared to be in the correct format using other methods in upstream steps.


#9

Ok, thanks for the detailed breakdown, Jennifer. I’ll try modifying the read seq IDs and re-running on my end, and will report back if it doesn’t work.

It’s too bad that the SRR files sometimes contain spaces in those IDs while a lot of tools might not be designed to tolerate them; if a Galaxy-based wrapper script can’t be instituted to parse such situations, I might suggest elevating this kind of input-format-consideration to a part of the site’s FAQ/common issues docs so that users can be aware of it and perform such a workaround on their own; if it’s already there, I must have missed it.

I’m glad to hear I’m not crazy about thinking the inputs had equal numbers of reads present, but I am curious why your earlier trick of swapping the forward and reverse read inputs worked on your end and not mine, if the real issue ended up being the read seq ID formatting. Were you using different reads as a starting point?

Also, while I was using another tool to get around the problem with Unicycler not working, I came across another issue with another tool on Galaxy that I’d like to note; should I do that via bug report built into the Galaxy Main server, or via this forum? Are there best practices guidelines as to how we interact with your team?


#10

I had just mixed up the tests, had about 20 running! All finished up then I went through everything all over again to figure out the root problem.

Correct, this specific issue about EBI SRA identifiers with spaces and Deinterlacer/Unicyler/Spades usage is not covered. The Unicycler tool was just updated and revealed the problem. Deinterlacer has always been picky about formats but the output swap has never come up before either (and is very strange, I need to examine the root cause more). I’d like to explore getting all this addressed at a higher level (tool level) and not require users to do a workaround. But if that doesn’t happen, then an FAQ for EBI SRA similar to this one for NCBI SRA data would be the next step: https://galaxyproject.org/support/ncbi-sra-fastq/. If you wanted to contribute an FAQ to the hub, that would be welcomed.

I’m not sure yet how many tools might have odd behavior due to this specific formatting issue and what new formatting would resolve the problem for all or most tools (if just removing the space is enough). Any tool that interprets sequence identifier /1 and /2 content could be impacted. When I run into tool errors, I try to reduce the data down to the most common format, regardless of type, to figure out what the tool author was expecting – these types of formatting variation requirements are rarely documented in tool manuals.

Thanks for all the feedback and hope your runs do better this time!


#11

Dear Team,

I have utilized Ion torrent based sample data in fastq format and have also performed groomer to get standard fastq information. I had started assembly with single sample still it failed. Below is the error reported.
“libgomp: Thread creation failed: Resource temporarily unavailable”.
Please can someone help in resolving this issue.


#12

Hi Team,
I’m also running into problems using Unicycler- I get the general error ‘Remote job server indicated a problem running or monitoring this job’, and when looking at the bug report I get the notification that spades has crashed.
I am using fastqsanger inputs, and have been inputting these directly into Unicycler (using all defaults) and then running it. The problem is, this was working beautifully until around the 20th of December (roughly) and gave me some great assemblies. After that however, Unicycler has been crashing each and every time I try and use it. Even when I return to histories where Unicycler worked and gave me an assembly, run the same data under the same conditions, Unicycler crashes.
I am very, very new to using both Galaxy and Unicycler so I apologise if I am missing something very obvious- it is possible I am! I tried using the interlacer and de-interlacer as suggested in this thread, but again, I am not sure what I am doing so I could very well be doing things wrong.

Thanks for the help that has already been provided- I will try and follow it and see if it works, I just wanted to bring up that Unicycler was working great previously in case something has happened.
If anyone can provide any further help on what might be happening, I would be incredibly grateful!
Thanks!


#13

Thanks for reporting the problem again!

Unicycler is now also failing on known tests cases that should be successful.

We are looking into the problem and will post back if this can be resolved quickly. If an issue ticket is created (meaning, this can’t be fixed quickly), I’ll post that link back here so everyone can track the issue/progress/resolution.


Update: Issue is now ticketed https://github.com/galaxyproject/usegalaxy-playbook/issues/185