Interproscan format issues

Hi folks,
I am having a hard time with mapping a large number of sequences using Interproscan and it appears to be that the tool is not recognizing my sequences as valid. I am able to correctly map some sequences that look like this:

d1f1aa_ b.1.8.1 (A:) Cu,Zn superoxide dismutase, SOD {Baker’s yeast (Saccharomyces cerevisiae) [TaxId: 4932]}
vqavavlkgdagvsgvvkfeqasesepttvsyeiagnspnaergfhiqefgdatngcvsa
gphfnpfkkthgaptdevrhvgdmgnvktdengvakgsfkdslikligptsvvgrsvvih
agqddlgkgdteeslktgnagprpacgvigltn

They are in fasta format like this:

Interposcan works with them:

but when I attempt to input other sequences (from another source) that look like this:

COX_I_YP_001518912.1_Cu_user;Acaryochloris_marina_MBIC11017(Bacteria/Cyanobacteria)
MTEAQAPHLEEVEVTPWREYFSFSTDHKVIGIQYLVTSFVFYLIGGLLAELVRTELATPASDFVPRETYNELFTMHATIMIFLWIIPTLTGGFGNFLVPLMIGARDMAFPKLNAIAFWIIPPTSILLLCSFFVGPASAGWTSYPPLSLMTNKAGEAIWILGVILLGTSSIMAGLNFLVTILKMRIPSMTLNDMPLFCWAMLATSALQLVATPVLSGAMVLLGFDLLVGTNFFNPAGGGDPIVYQHMFWFYSHPAVYIMILPAFGLISEILPVHARKPIFGYQAIAYSSIAISFLGLIVWAHHMFTSGTPDWLRMFFMIATMVIAVPTGIKVFSWVATVWGGKLNLCSAMLFGMAFVSMFVVGGLSGVMVASVPFDIHVHDTYFVVAHLHYVLFGGSVFGIYAGLYHWFPKMTGRMLNEFWGKVHFAMTFVGFNICFLPMHVLGLQGMNRRIAEYDPKFAALNVVCTIGSYILATSTIPFVVNAVWSWLAGPRANSNPWKGLTLEWTVPSPPPVENFEEDPVLAIGPYDYGTPKALDFVAATLAPAHALAAESLE

in the fasta like this:

The analysis fails. The bug report looks like this:

I tried simplifying the header format, but am unclear why this is still not being picked up as a valid sequence. Both files are correctly imported as fastas.

Hi @Paul_Helfrich

The tool thinks that your sequences are not protein amino acid sequences.

I would start with these two:

  1. Double check that all characters are in fact aa residues.

    • I’m not sure if stop codons are Ok, but you can check the tutorial examples to see if those explain, or just remove those if present and test.
  2. Simplify the fasta format. Using NormalizeFasta and stripping off the description part of the > fasta title lines, leaving only the identifier, is a common and quick way to do this.

    • Description content can get in the way, especially if it contains odd characters, and isn’t used by this tool anyway so can be safely removed.
    • You could parse that out, then join it back in after processing if you want it.
    • The default wrapping length is probably OK but I always use 80 since that was the original, and therefore the most commonly accepted wrapping length (with maybe 40 for protein).
    • You could also test unwrapped to see what happens.

I haven’t used this tool in a while, so these are mostly guesses. You are welcome to share back a small history with just this error for more help. :scientist:

Thank you very much, I was able to solve it with the NormalizeFasta as you suggusted!

1 Like

Great, glad that worked! :rocket: