Hi guys, new user here. I managed to blastp my sequence, but it contains duplicates and I am unable to get rid of them or to bypass it, so Clustal is giving me an error message. I tried “Unique” tool on both fasta and tabular files, nothing helped. Tool to merge multiple fasta files and filter unique sequences is also returning an error. What should I do? Thanks!
Hello @David_Prikryl
For these technical processing issues:
-
The Filter FASTA tool had a technical problem that should be resolved now. Please give it a try.
-
For concatenating multiple fasta files, put them all into a collection list folder, then change the datatype to uncompressed (pencil icon), and then run the Collapse Collection tool.
From there, do you want to explain what you are doing? Maybe there is a better way to get this done. Below is what I think you want to do, and I may be wrong in parts, but we can use it as a starting place that you can clarify from.
- Starting with a protein sequence
- Run BLASTp against a protein database to find homology based hits
- (you are having trouble here)
- Run ClustalW on the best hits (unique sequences)
- Run FASTTREE to create a tree
If this is correct, you could generate the tabular output (Step2 above), filter it for significant hits (don’t skip this!), retrieve the protein sequences for those hits, run ClustalW, then run FASTTREE.
You want the entire protein sequence for the hits, yes? With the hit identifiers isolated into a unique list, you could get the entire protein for each with → NCBI BLAST+ blastdbcmd entry(s) Extract sequence(s) from BLAST database. This would be a unique file of fasta sequences to use with ClustalW, sourced from the same database that you mapped against.
Please review and let me know what I guessed both correct or not – and I’ll try to help more.