Large discrepancies between processing identical files on Org vs EU site

Hello, I am very new to this, so forgive me if I’m missing something but I’ve come across massive discrepancies between processing the same files on the same supposed tools on the EU vs Org sites.

The first issue I had was when trying to map paired fastq files on the EU site using “Map with BWA-MEM” on default settings. These files were part of a larger dataset I got from ENA, and every other paired files in the batch had been processed without issue. However, this one failed and I was presented an error. I repeated this twice and still ended up with an error. I tried to process this paired set using the same exact tool on the org site and to my surprise it was processed without issue.

The second crazy discrepancy was when I processed another paired set on the same tool (on the EU site) and it was marked as completed without issue, but upon further inspection I saw that the resulting BAM file was much smaller than the other completed BAM files (2 GB vs the average 10-25 GBs). Again, I decided to process the same paired files on the same exact tool on the org site and the resulting BAM file was a much more reasonable 21 GBs.

What’s going on? Why are there very large discrepancies using, allegedly, identical tools on the same settings? How can I expect to have a consistent QC when co-analyzing datasets processed on EU vs ORG? I’m not even sure I can rely on the EU site

Hi @M.r.t

Sorry to hear that you have been having problems.

The tools across servers should be exact. Maybe we can discover what the differences are so you can get the results synced up? Using a workflow would be one way to do this since all of the details would be in the workflow: tool versions, parameter settings, reference data. Then you would control the inputs. You could even move the history with the inputs across servers, intact, then run the workflow against both copies, and then expect identical results.

Now, some tools will have non-deterministic algorithms, but the differences between runs even on the same server would be expected to vary a bit, too, not just the tool running on different cluster nodes, since time would be a variable, no matter where you are working.

I would like to review some of the differences you found. You can share back the histories with the data. If you do not want to share those publicly, we can move into a private message here instead, and only add in other administrators as needed (I’ll ask first!).

This error would be interesting to review: the history share link for both servers. Please note the datasets involved. I’ll be able to see the other details from just that.

And, this one would also be interesting to review.



How to generate a history share link is in the banner of this forum, also here →

Let me know if you want to post those into a chat message instead of here. The screenshots will probably not be enough for this one since we are looking at the deeper technical details too. Thanks! :slight_smile:

Hi @jennaj . A chat message will be better. I’m not sure how to send one here

1 Like

I just sent you a message to get it started. I included instructions for how to share in there too.

Find your messages at the very top right at this site, under your account icon.

Hi @M.r.t

Thanks for sharing the histories! Very helpful.

It looks like the reads from the pair didn’t load completely from the SRA at the UseGalaxy.eu server. This then lead to the downstream result discrepancy when mapping. Some mapping tools may have trapped this but I wouldn’t count on that – always verify the read content yourself. Even if trapped, the error messages can be odd and unclear, but a dedicated QA tool will usually report something actionable.

So, I was able to discover this by running some QA on the reads. The jobs failed, then I reviewed the job details view for the job using the i-icon. On those views, I saw these messages.


Then, if I used the bug-icon, the error message could be interpreted by our new Galaxy AI assistant! It interpreted this really well! It is new so I wouldn’t expect it to be perfect yet for every error plus so much depends on how the original author trapped error cases. But anything unclear you can always ask about here still. :slight_smile: And anything really odd that seems to be a server issue, you can go ahead and submit the bug report and the server administrators will see it.

Expand to read the text from the gxy chat bot directly

Hello jenj,

The error message you’re encountering is from a tool known as FastQC, which is commonly used for quality checking of sequencing data in FASTQ format. Here’s a simplified breakdown of the error you are seeing:

Error Explanation:

  • Failed to process file ERR6068440_2_fastq_gz.gz: This indicates that the tool encountered a problem when trying to process this specific FASTQ file.
  • SequenceFormatException: This error suggests an issue with the format of the sequence data within your FASTQ file.
  • Midline ‘AATGCTGTCTCTC<TTCTTTCA9:8186)CCTTTGGTCCTGTAGGTCACAACTGTGTTTAT’ didn’t start with ‘+’ at 39603447: In a FASTQ file, each sequence entry consists of four lines:
    1. A sequence identifier line beginning with ‘@’
    2. The raw sequence letters
    3. A separator line beginning with ‘+’
    4. A quality score lineThe error indicates that at byte position 39,603,447 (or somewhere in the file), a line expected to start with a ‘+’ doesn’t comply with this format, which is required as the third line for each sequence entry.

Suggested Solution:

  1. Check File Integrity:
  • Verify if the file got corrupted during download or transfer. You might want to redownload or decompress the file again if it was compressed, then re-upload it to Galaxy.
  1. Examine the Problematic Line:
  • If you are familiar with command-line tools, you can inspect the file using a text editor or tools like less or head to navigate to the specified position and check if there is an unexpected character or misalignment in the file.
  1. Re-run the Tool:
  • Once you’ve verified or corrected any issues with the file, try running the FastQC tool in Galaxy again.

If these steps don’t resolve the issue, the file might require sequence-specific investigations to ensure all entries conform to the FASTQ standard. Should you need further assistance, consulting with your data provider or a colleague familiar with the sequence data format might be beneficial.

I hope this helps guide you towards resolving the issue. If you have more questions, feel free to ask!



What to do

Whenever data is newly loaded up to Galaxy, run some simple checks to make sure the transfer happened correctly. That might just be inspecting the data to confirm the datatype but can also involve dedicated content checking tools. Fastq reads are sensitive to what is in the inside, too, the content, so assessing that is usually a good idea, along with manipulations like trimming.

More about UploadGetting Data into Galaxy

Then, more about QA/QC is covered in this prior topic. The workflow at the end here would likely be a good fit for you. Getting the reads with the dedicated SRA read fetching will also add in some stability versus using free URLs, too, since that is handled a bit differently by the SRA data servers. That will also avoid potential content loss with multiple data transfer steps – example: from the cloud somewhere down to your computer, back up into a cloud resource. Cloud to cloud is one less hop.



I hope this helps! We can follow up more about any questions you have, here or in the private chat. :rocket:

XRef → Search results for 'uk.ac.babraham.FastQC.Sequence.SequenceFormatException' - Galaxy Community Help

A post was split to a new topic: Human genotype analysis