Question about interproscan

Hello, I am having difficulties with interproscan. I am entering the output from Getorf (fasta) and get the error below. I have tried editing the input to simplify headers, alter line width etc but with no success. Please help with entry file format and maximum number of proteins in one file

Many thanks

Mark

Welcome @Mark_C

We can probably help with interproscan questions! Would you like to share the history with the error? This will allow us to review the details and make suggestions. Thanks! :slight_smile:

Dear Jennifer

I have purged a lot of the intermediate files but the outline is

RNAseqfile → Trimmomatic → RNAspades → getorf → interproscan

Happy to share the history but not sure how

Thanks

Mark

Hi @Mark_C

Thanks for sharing details! You can share what you have and we can try to determine what is going on. The link was inside of the other – but a lot of information so just click on this for how to generate the link.

:graduation_cap: Share via link

  • Open the History Options galaxy-history-options menu at the top of your history panel and select “history-share Share or Publish”
    • galaxy-toggle Make History accessible
    • A Share Link will appear that you give to others

Where to find the menu. The first option is a toggle to generate a link. You can copy and paste that back here in your reply. :slight_smile:

Hi

Sorry for the slow reply, I’ve been ill for the past few days

I try and post a link to the history here and it tells me I can’t ….

Many thanks

Mark

HI @Mark_C

Please try again now! :slight_smile:

HI @Mark_C

It looks like you are deleting jobs before they have a chance to run but I can’t really see the details with the current sharing permissions. The single failure I can see is about Trinity – and yes – your work may be too large or more likely you need to do more QA. This is Ok, at the end I’m going to recommend a tutorial that I think will help. With the protocol and with tool choices (RNA-seq data is handled differently than DNA).

To learn more about how the job queues work, see topic tagged with queued-gray-datasets

With more here

I would suggest getting everything started up again, then allowing the data to process! Avoid deleting and rerunning. If you do this quick enough, your jobs may never get a chance to run.

Then, for RNA-seq Transcriptomics pipelines, be sure to see our wonderful suite of tutorials at the :graduation_cap: Galaxy Training Network

Or, for DNA Assembly, please see

If new to Galaxy, please consider running through some basics here too!

All of the demonstration analysis projects include a workflow. Workflows are how to run analysis in batches. This is how to maximise the use of the public computational infrastructure.

Later on, you can explore our HTP production quality workflows in the :shuffle_tracks_button: IWC Workflow Library.

If I were to suggest just one item based on what you are doing now, this is what I would recommend the most, assuming you have RNA-seq data.

Hope this helps! :slight_smile:

Hi

Thanks for your email. I do quit some jobs before completing when I realise I have entered the wrong parameters. I have also done many successful transcriptome analyses and find Galaxy really good. There are problems with the pipelines due to the biology of the organism I work with and so the analysis has to be done stepwise. This is the first time I have tried to use Interproscan.

With interproscan I do not quit early as the red ‘an error occurred with the dataset’ message occurs after around 60 seconds (see earlier in the stream). So the questions I have are: am I correct in thinking the input file should be .fasta? And, is there a maximum number of sequences in the .fasta file? Does the line length of the sequences in the .fasta file matter?

Many thanks

Mark

Hi @Mark_C

We weren’t able to see the details for the job in your history, or the inputs, so guessing what the error was coming up is hard to guess. In general, if a tool fails really quickly, there is some content issue with the inputs or possibly the server has some issue the administrators need to resolved. We can help to troubleshoot this.

Also keep in mind that several of the annotation modules for this tool are only available on a private version of Galaxy (or command line version). The public servers are only able to host the data that is not under a license. Example → Interproscan log4j error - #2 by jennaj.

Then, to troubleshoot on your own, you can review the tutorials we have that include the tool. The example data formats are probably what will be most interesting for you.

What to do

  1. Compare to the examples for where the tool is usually used in protcols, plus common data preparation steps.
  2. The input should be in fasta format. A tool like NormalizeFasta can help! Or Fasta Statistics to check the bases.
  3. How many sequences is hard to guess since the content of those sequences will matter more.
  4. Try with a subsample to see what happens if you think the content is exceeding resources (this use case usually involves the tool running for longer, then eventually dying, instead of a quick failure).
  5. For large work, running Split Fasta can reduce the job into smaller jobs, then you can use Collapse Collection after to merge. Be careful about the output format if you decide to use this – TSV is usually the safest choice.

For reviewing your history, if you want to toggle the sharing off then on again, and include the error you would like feedback for, including the inputs to that job, we can try again! Thanks! :slight_smile: