Interproscan format issues

Paul_Helfrich · October 31, 2024, 11:00pm

Hi folks,
I am having a hard time with mapping a large number of sequences using Interproscan and it appears to be that the tool is not recognizing my sequences as valid. I am able to correctly map some sequences that look like this:

d1f1aa_ b.1.8.1 (A:) Cu,Zn superoxide dismutase, SOD {Baker’s yeast (Saccharomyces cerevisiae) [TaxId: 4932]}
vqavavlkgdagvsgvvkfeqasesepttvsyeiagnspnaergfhiqefgdatngcvsa
gphfnpfkkthgaptdevrhvgdmgnvktdengvakgsfkdslikligptsvvgrsvvih
agqddlgkgdteeslktgnagprpacgvigltn

They are in fasta format like this:

Interposcan works with them:

but when I attempt to input other sequences (from another source) that look like this:

COX_I_YP_001518912.1_Cu_user;Acaryochloris_marina_MBIC11017(Bacteria/Cyanobacteria)
MTEAQAPHLEEVEVTPWREYFSFSTDHKVIGIQYLVTSFVFYLIGGLLAELVRTELATPASDFVPRETYNELFTMHATIMIFLWIIPTLTGGFGNFLVPLMIGARDMAFPKLNAIAFWIIPPTSILLLCSFFVGPASAGWTSYPPLSLMTNKAGEAIWILGVILLGTSSIMAGLNFLVTILKMRIPSMTLNDMPLFCWAMLATSALQLVATPVLSGAMVLLGFDLLVGTNFFNPAGGGDPIVYQHMFWFYSHPAVYIMILPAFGLISEILPVHARKPIFGYQAIAYSSIAISFLGLIVWAHHMFTSGTPDWLRMFFMIATMVIAVPTGIKVFSWVATVWGGKLNLCSAMLFGMAFVSMFVVGGLSGVMVASVPFDIHVHDTYFVVAHLHYVLFGGSVFGIYAGLYHWFPKMTGRMLNEFWGKVHFAMTFVGFNICFLPMHVLGLQGMNRRIAEYDPKFAALNVVCTIGSYILATSTIPFVVNAVWSWLAGPRANSNPWKGLTLEWTVPSPPPVENFEEDPVLAIGPYDYGTPKALDFVAATLAPAHALAAESLE

in the fasta like this:

The analysis fails. The bug report looks like this:

I tried simplifying the header format, but am unclear why this is still not being picked up as a valid sequence. Both files are correctly imported as fastas.

jennaj · November 4, 2024, 6:32pm

Hi @Paul_Helfrich

The tool thinks that your sequences are not protein amino acid sequences.

I would start with these two:

Double check that all characters are in fact aa residues.
- I’m not sure if stop codons are Ok, but you can check the tutorial examples to see if those explain, or just remove those if present and test.
Simplify the fasta format. Using NormalizeFasta and stripping off the description part of the > fasta title lines, leaving only the identifier, is a common and quick way to do this.
- Description content can get in the way, especially if it contains odd characters, and isn’t used by this tool anyway so can be safely removed.
- You could parse that out, then join it back in after processing if you want it.
- The default wrapping length is probably OK but I always use 80 since that was the original, and therefore the most commonly accepted wrapping length (with maybe 40 for protein).
- You could also test unwrapped to see what happens.

I haven’t used this tool in a while, so these are mostly guesses. You are welcome to share back a small history with just this error for more help.

Paul_Helfrich · November 7, 2024, 4:43am

Thank you very much, I was able to solve it with the NormalizeFasta as you suggusted!

jennaj · November 7, 2024, 6:16pm

Great, glad that worked!

Topic		Replies	Views
MITObim results in "An error occurred with this dataset: fasta format" usegalaxy.eu support tool-help , mitobim	1	18	December 6, 2024
Errors with Humann and Interproscan usegalaxy.org.au support metagenomics	1	176	January 16, 2024
InterProScan missing indexed at UseGalaxy.fr tool-help , interproscan	3	25	September 23, 2024
de-interleave issues usegalaxy.org support fastq-deinterlacer , third-party-identities	8	78	July 1, 2024
Samtools mpileup usegalaxy.org support tool-help , samtools_mpileup	1	10	February 24, 2025

Interproscan format issues

Related topics