Repeatexplorer fails on single end input

Has this issue been encountered before? I uploaded a fasta of pseudoreads for repeatexplorer, with file size 4.8Gb. The file was imported from FTP and it turned green. The number of sequences and the first few lines looked good. A few minutes after starting the tool, it failed with a file not found error.

Looks to me like empty single quotes after ${GALAXY_MEMORY_KB} indicate the Galaxy tool is putting an empty string for the filepath for seqclust. What can be changed so the tool accesses the input filepath?

Command line:

export GALAXY_MEMORY_KB=$((${GALAXY_MEMORY_MB:-8192}*1024)) && export PYTHONHASHSEED=0 && mkdir -p ‘/data/dnb11/galaxy_db/files/d/7/3/dataset_d7369c2f-7f43-4a5a-9dec-03dd0ee52322_files’ && /repex_tarean/seqclust --cpu ${GALAXY_SLOTS:-1} --max_memory ${GALAXY_MEMORY_KB} ‘’ --taxon ‘VIRIDIPLANTAE3.0’ --output_dir=‘/data/dnb11/galaxy_db/files/d/7/3/dataset_d7369c2f-7f43-4a5a-9dec-03dd0ee52322_files’ --mincl ‘0.0001’ --assembly_min ‘2’ --keep_names ‘/data/dnb11/galaxy_db/files/c/2/f/dataset_c2fa7507-33cf-4412-a974-5417eaddadaa.dat’ && cp ‘/data/dnb11/galaxy_db/files/d/7/3/dataset_d7369c2f-7f43-4a5a-9dec-03dd0ee52322_files/index.html’ ./index.html

Tool standard error:

usage: seqclust [-h] [-p] [-A] [-t] [-l LOGFILE] [-m {float range 0.0..100.0}]

            \[-M {0,float range 0.1..1}\] \[-o {float range 30.0..80.0}\]

            \[-c CPU\] \[-s SAMPLE\] \[-P PREFIX_LENGTH\] \[-v OUTPUT_DIR\]

            \[-r MAX_MEMORY\] \[-d DATABASE DATABASE\] \[-C\] \[-k\]

            \[-a {2,3,4,5}\]

            \[-tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}\]

            \[-opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}\]

            \[-D {BLASTX_W2,BLASTX_W3,DIAMOND}\]

            sequences

seqclust: error: argument sequences: can’t open ‘’: [Errno 2] No such file or directory: ‘’

1 Like

Welcome @David_R

The memory not being specified in the string is Ok, it just means it will be set by the cluster environment instead (at a higher level).

Instead, the tools seems to be having trouble reading in the input file. Let’s go through the details and come up with next steps!


Screenshot

Tool form → RepeatExplorer (clustering) repeat discovery and characterization using graph-based sequence clustering(Galaxy Version 2.3.8+galaxy0) as hosted at UseGalaxy public servers


Expected inputs

  • Dataset with the fasta datatype format assigned
    • “Input file must contain FASTA-formatted NGS reads. Illumina paired-end reads are recommended.”
  • Choice made to specify whether the reads are single or paired end
    • “If paired-end reads are used, they must be interleaved and all pairs must be complete.”

How did your dataset and the options set compare to this? You can post screenshots or share your history back here if you would like help. → How to get faster help with your question

You can also do some checks yourself.

If the file is in fasta.gz format, the datatype fastagz will be assigned by the Upload tool. Then, when this tool is executed, there will be a automatic file format conversion process to uncompress the data at runtime. The uncompressed version of the data will be in your history. If this process had some problem, it usually means that something went wrong with the assigned datatype format (a mismatched label for the actual content) or that the file is compressed on a way that Galaxy cannot interpret.

However – since you used FTP I’m also curious about which server you were working at and how that was done. This was at UseGalaxy.eu, correct? Maybe there is some problem.

What to try

  • Start here → Run the tool FastqInfo on the file to check the formatting/compression.
  • Verify the single/paired content inside the file and make sure the toggle on the form is the same. Interleave the data first if needed.
  • Consider uncompressing the file yourself first (pencil icon - Edit attributes) then submitting the fasta version instead to see what happens.
  • Possibly Upload the file again and make sure to use the autodetect option to allow Galaxy to guess the datatype. → Getting Data into Galaxy.
  • In rare cases, loading up the uncompressed version of a file is needed. There are a few versions of gz out in the wild that are not compatible.

If the input and options all check out, and a rerun still fails, then the problem could be on the server itself, maybe an issue with the tool wrapper. We can help to confirm and alert the administrators if you share your history example or even just the server URL and exact tool version so we can try to reproduce the issue. Maybe FTP at the EU server has some issue but if you are working somewhere else, that would be important to know.

So – I’ll test FTP at EU and you can do a few checks on the file itself. If it didn’t load into the history, other tools will fail too – so, starting with the Fastq Info tool will reveal that.

Hope this helps and we can follow up more! . :slight_smile:

Thank you, Jennaj! I tried running FASTQ Info now and it looks like that only works for .fastq; mine is a fasta. It is on Galaxy EU, and I am in the US – looks like Galaxy EU is the only one that has repeatexplorer. Here is my history:

Also, the fasta has single reads, and repeatexplorer was set to single reads. I took the same region in 4 species, containing my gene cluster of interest, and I generated pseudoreads in R Studio, to run repeatexplorer, and hopefully get repeats that overlap more of the intergenic sequences that are conserved among the species. They aren’t real raw reads, and it seemed simpler to go with single reads for my current goals.

Now, I’m running convert fasta to tabular on my data to see if this tool can use the file.

And the FTP client I used was FileZilla, on my PC. I made the FASTA in R Studio VM on Terra and copied it to Google Cloud Storage, then downloaded it to my PC from GS and used FTP to send it to Galaxy EU. I had previously uploaded it directly and had, I think, the same ‘file not found’ error in repeatexplorer. I tried about 4 times with direct upload, deleted the file, and then used FTP twice.

fasta to tabular worked

Thanks for looking into this! I noticed the command line shows --max_memory ${GALAXY_MEMORY_KB} ‘’—the empty quotes ‘’ , I believe, are meant to be the input file path, but they’re being treated as an argument to --max_memory instead of as a standalone positional argument. This seems like a wrapper bug where the input parameter is misplaced. Could the tool maintainers check the XML wrapper for parameter ordering?

Thank you for looking at this issue! Here are all the details in one message:

Hello, repeatexplorer on usegalaxy.eu is not working for me, and it appears to be a problem with the XML wrapper.

The command line shows --max_memory ${GALAXY_MEMORY_KB} ‘’—the empty quotes ‘’ , I believe, are meant to be the input file path, but they’re being treated as an argument to --max_memory instead of as a standalone positional argument. This seems like a wrapper bug where the input parameter is misplaced. Could the tool maintainers check the XML wrapper for parameter ordering?

In order to confirm my input fasta or pseudoreads is not corrupted, I ran fasta to tabular–this converted the file successfully. The first few lines in my fasta and the output tabular look good. I don’t think it’s a problem with the fasta.

Here is a link to my history:

Here is the command line from my attempt to run repeatexplorer:

export GALAXY_MEMORY_KB=$((${GALAXY_MEMORY_MB:-8192}*1024)) && export PYTHONHASHSEED=0 && mkdir -p ‘/data/dnb11/galaxy_db/files/d/7/3/dataset_d7369c2f-7f43-4a5a-9dec-03dd0ee52322_files’ && /repex_tarean/seqclust --cpu ${GALAXY_SLOTS:-1} --max_memory ${GALAXY_MEMORY_KB} ‘’ --taxon ‘VIRIDIPLANTAE3.0’ --output_dir=‘/data/dnb11/galaxy_db/files/d/7/3/dataset_d7369c2f-7f43-4a5a-9dec-03dd0ee52322_files’ --mincl ‘0.0001’ --assembly_min ‘2’ --keep_names ‘/data/dnb11/galaxy_db/files/c/2/f/dataset_c2fa7507-33cf-4412-a974-5417eaddadaa.dat’ && cp ‘/data/dnb11/galaxy_db/files/d/7/3/dataset_d7369c2f-7f43-4a5a-9dec-03dd0ee52322_files/index.html’ ./index.html

Tool standard error:

usage: seqclust [-h] [-p] [-A] [-t] [-l LOGFILE] [-m {float range 0.0..100.0}]

        \\\[-M {0,float range 0.1..1}\\\] \\\[-o {float range 30.0..80.0}\\\]

        \\\[-c CPU\\\] \\\[-s SAMPLE\\\] \\\[-P PREFIX_LENGTH\\\] \\\[-v OUTPUT_DIR\\\]

        \\\[-r MAX_MEMORY\\\] \\\[-d DATABASE DATABASE\\\] \\\[-C\\\] \\\[-k\\\]

        \\\[-a {2,3,4,5}\\\]

        \\\[-tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}\\\]

        \\\[-opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}\\\]

        \\\[-D {BLASTX_W2,BLASTX_W3,DIAMOND}\\\]

        sequences

seqclust: error: argument sequences: can’t open ‘’: [Errno 2] No such file or directory: ‘’

#can’t open ‘’ – file path is supposed to be there and it appears to be empty due to failure of the XML wra

pper to provide the input file here

Hi @David_R

Thanks for sharing all of your details!

For the empty quotes, I ran a test. The command is using that field for the paired/not toggle. If the flag is included, paired. If not, use the default (not paired).

This is my example:

export GALAXY_MEMORY_KB=$((${GALAXY_MEMORY_MB:-8192}*1024)) && export PYTHONHASHSEED=0 && mkdir -p '/data/dnb11/galaxy_db/files/d/f/5/dataset_df5d5080-4303-4203-84c9-51988977dc21_files' && /repex_tarean/seqclust --cpu ${GALAXY_SLOTS:-1} --max_memory ${GALAXY_MEMORY_KB} '--paired' --taxon 'VIRIDIPLANTAE3.0' --output_dir='/data/dnb11/galaxy_db/files/d/f/5/dataset_df5d5080-4303-4203-84c9-51988977dc21_files' --assembly_min '5' '/data/dnb11/galaxy_db/files/7/5/3/dataset_753c7f39-c397-4c26-ab61-954cf6df93a5.dat' && cp '/data/dnb11/galaxy_db/files/d/f/5/dataset_df5d5080-4303-4203-84c9-51988977dc21_files/index.html' ./index.html

–max_memory ${GALAXY_MEMORY_KB} ‘–paired’ --taxon ‘VIRIDIPLANTAE3.0’

Shared testing history using the tool test → https://usegalaxy.eu/u/jenj/h/test-repeatexplorer

So .. I am wondering if the “not paired” usage is a problem. I’ll test that next in the same history. If a problem, we can get that reported to the developers. I’ll follow up with you here either way so we can close that out as being the error trigger.

And, whoops for suggesting Fastq Info! You are totally correct. Fasta Statistics would be the simple choice for fasta data. Converting to tabular and back was a great idea to check the format. The other tool will check the content (odd base characters come to mind, since IUPAC characters might be a potential error trigger).

You could also subset your data into a smaller testing file, run that, and see if you get a different result. Perhaps the original file was “too large” for some reason and we are looking at an untrapped crash out of some type. If you did this already (maybe I missed it), just point me where to look. I was away for a few days and getting oriented with your issue again. :slight_smile: I do want to solve it!!

I can confirm the usage issue with single end. The job fails almost immediately. The tool tests do not cover single end usage at all so I’m wondering if this will be supported or not, or if some development work is still in progress.

I’ve opened a ticket for the developers to review. Please feel free to add more comments and to follow along! They will also probably comment back here once they see it. → Repeatexplore fails with single end reads · Issue #106 · galaxy-genome-annotation/galaxy-tools · GitHub

Since you are using constructed reads, I suppose the next step could be to develop an appropriate dataset for paired end? Anyway, I hope this works out and lets see what the developers think. :hammer_and_wrench:

@jennaj Thank you for testing and reproducing the error! It looks like it works if the box for paired-end reads is checked, and if the box is not checked, the wrapper inserts empty quotes at the location of the ‘–paired’. My current understanding is that, for single-end reads, I would need the XML wrapper to completely eliminate that set of quotes so repeatexplorer defauts to single-end reads. I converted the tabular back to fasta today and it has the same number of sequences, with the first few lines looking good still. I ran fasta stats on the original fasta as well. I understand trying repeatexplorer on a smaller file could help confirm whether file size is an issue. However, your test used a smaller file and reproduced the same error with single-end reads. In principle, I understand repeatexplorer is commonly used on raw reads for whole genomes, which can mean the input file size is around 70Gb. Since my constructed reads, made from 1 homologous region in 4 reference genomes, are a much smaller file than this, I am still hoping repeatexplorer can do it all at once to find repeats in the whole data set and classify which ones are more similar. If it is not possible to run repeatexplorer on single-end reads, I can make new pseudoreads which are paired-end. However, I understand the benefit for paired-end reads not to be as great when the reads are constructed. I currently think it is simpler and preferred for my current analysis, for repeat discovery in homologous regions of 4 genomes, to be performed on single-end pseudoreads if possible.

Great, and yes I agree, it seems this should be able to work on single-end and even a tiny file will fail with those parameters. I see that you made some comments on the ticket which should help to get some eyes on it!

Most development work is from the community, so if you want to maybe speed this up, the developer might be accepting PRs at their repo. You could try with the changes you think will be enough (along with a new tool test). The IUC is collaborating with these – example from the prior development if this interests you (or anyone else reading along here!) → Pull requests · galaxy-genome-annotation/galaxy-tools · GitHub

I’ll see what suggested fixes I can come up with if others don’t fix it first

1 Like