khmer: Extract partitions

I’m trying to use “Khmer: extract partitions” and it fails everytime I try it. I’ve tried changing all of the input parameters, but it always fails after maybe 10 seconds. Is it possible there’s something wrong with the galaxy version of the tool? I’ve included a screenshot for more details.

Hi @AndyCarter

Since the job is failing so quickly, I’m guessing there might be something wrong with the input file.

Try this – you are on the right job information view but we can’t see all of the details and logs. FAQ: Troubleshooting errors

  1. Review the input file data
  • Click on dataset 25 – does it have content?
  • Make sure the file is not empty
  • Then make sure the file is not just headers without data lines.
  • Since this is fasta data (yes?) there shouldn’t be any headers, only sequences.
  • You could run a tool like NormalizeFasta to standardize the format. Removing description content from fasta > title lines is pretty common, so toggle this option to Yes to do that → Truncate sequence names at first whitespace
  1. Review the input file metadata
  • Click on dataset 25 to expand it
  • What is the datatype format and the database?
  • Is the format what you expected it to be?
  • Does it match the “accepted formats” for that input section on the tool form?
  • Does the tool form Help section have any more details about how the tool works that might explain what is going on?
  • Example: if the tool is expecting to process nucleotide sequences but a protein sequence is input (both would be “fasta”), that could cause a tool to fail.
  1. Review how the file was Uploaded
  • Did you use all defaults with the Upload tool? Getting Data into Galaxy
  • Has that same file worked with other tools? The first time you use a file is usually when a problem with a file is uncovered.
  • You could run a tool like Fasta Statistics on it as a type of sanity check to review the content, and make sure it was uploaded without problems.
  • You could also try to adjust how a file is formatted.
  • Example: you might need to uncompress a file before a tool can work with it. Use the pencil-icon to reach the Edit Attributes tabs. Use the Datatype tab to convert the format.
  1. Review the Job Logs: stderr and stdout
  • Scroll down a bit more to the Job Information. FAQ: Troubleshooting errors
  • You will be interested in the Standard Output (stdout) and the Standard Error (stderr) messages.
  • If the tool found a problem or has runtime information, this is where to find it.
  • Sometimes a tool will be able to tell you exactly what is going wrong!
  • You can also copy/paste those messages and search at this forum to see if someone has asked about it before.
  • You can even try to search with that message just freely with your favorite search engine to see what happens.

I’m being a bit verbose since this tool hasn’t had any questions at this forum yet! And you didn’t share your history so we can’t look at the details yet.

You are welcome to copy/paste or screenshot any of that information back here in a reply. You could also generate a history share link and paste that back here – that would let us diagnose a server side issue, too. How to generate that link is in the banner at this forum and also here directly → How to get faster help with your question

Thanks and I’ll watch for your next reply. :slight_smile: If you solved this already, it would be great if you let us know, and maybe explain a bit so the next person can read up on what worked for you, too?

Hi Jenna, thanks for all the detail. I don’t think the problem is with my input data. It’s a FASTA file I’ve used a dozen times. It shows as having 1,019,071 sequences and when I look at the file it looks like a fasta file to me. Here’s a link to that data: https://usegalaxy.org/api/datasets/f9cad7b01a472135c9420114d0aaf806/display?to_ext=fasta

Just to be sure, I tried the tool with a smaller fasta file that I’ve also used dozens of times. It failed in about 10 seconds too. Here’s the link to it: https://usegalaxy.org/api/datasets/f9cad7b01a4721351de8655f4bac6011/display?to_ext=fasta

Here’s the link to the failed runs:

https://usegalaxy.org/api/datasets/f9cad7b01a4721351b0de3c12650e6dc/display?to_ext=txt

https://usegalaxy.org/api/datasets/f9cad7b01a4721359fea6458bb079416/display?to_ext=txt

I’ve had it fail on other data too, but it’s just more of the same. Thanks for your help with this.

Great, thanks for sharing. I’m checking to see what is going on. Appreciate the examples! :slight_smile: and more soon!

If there’s anthing else you need to get to the bottom of this, let me know. I’m happy to share

1 Like

Hi @AndyCarter Ok, I see the problem. Your fasta files do not have the “partition” information on the > lines yet.

You can use this tool to generate those → khmer: Sequence partition all-in-one or run the preparation tools individually. The size of the input can matter according to the Help, so consider that with your files. I’m not sure if “size” means the number of sequences or the length of the sequences … but you could test that and the publication probably explains more.

You have sequences like this – these are the tool test sequences, not yours, so the important part is the fasta formatting not the nucleotides or identifiers.

>35
CGCAGGCTGGATTCTAGAGGCAGAGGTGAGCTATAAGATATTGCATACGTTGAGCCAGC
>16
CGGAAGCCCAATGAGTTGTCAGAGTCACCTCCACCCCGGGCCCTGTTAGCTACGTCCGT
>46
GGTCGTGTTGGGTTAACAAAGGATCCCTGACTCGATCCAGCTGGGTAGGGTAACTATGT
>40
GGCTGAAGGAGCGGGCGTACGTGTTTACGGCATGATGGCCGGTGATTATGGGGGACGGG
>33
GCAGCGGCTTTGAATGCCGAATATATAACAGCGACGGGGTTCAATAAGCTGCACATGCG

But the tool is looking for a “partition” group label. That is annotated in the description content on the title lines. (“description” is anything after the first whitespace on the > lines).

>35	2
CGCAGGCTGGATTCTAGAGGCAGAGGTGAGCTATAAGATATTGCATACGTTGAGCCAGC
>16	2
CGGAAGCCCAATGAGTTGTCAGAGTCACCTCCACCCCGGGCCCTGTTAGCTACGTCCGT
>46	2
GGTCGTGTTGGGTTAACAAAGGATCCCTGACTCGATCCAGCTGGGTAGGGTAACTATGT
>40	2
GGCTGAAGGAGCGGGCGTACGTGTTTACGGCATGATGGCCGGTGATTATGGGGGACGGG
>33	2
GCAGCGGCTTTGAATGCCGAATATATAACAGCGACGGGGTTCAATAAGCTGCACATGCG

I’ve run the test at UseGalaxy.eu and UseGalaxy.org – all seems to be working as expected. These are my shared histories (below) if you want to take a closer look. If you want to get more examples, you can go into the Options → See in ToolShed link to reach the development repository. Since I looked that up already, you can also just go here → tools-iuc/tools/khmer at main · galaxyproject/tools-iuc · GitHub

How tool wrapper development repositories are organized is usually the same no matter who wrote it: the /test-data directory will have example data and the tool .xml will have tool tests using that data toward the end. The publication is usually another source of data that will have more explanations about what is going on, along with the scientific context.

Shared data – I’ll leave this here as a reference :scientist:

More about fasta formatDatatypes - Galaxy Community Hub

Hope this helps!

Thank you Jenna!

1 Like