I was wondering if someone knows how to download the fasta sequences of proteins without their predicted signal sequence. I get the protein data from the UniProt database. At the moment I’m removing the signal sequence form my fasta sequences by hand. But I reckon there has got to be a more efficient way, right? I really appreciate your help.
I am a newbie also in that forum. I guess this forum is more on getting help about running Galaxy tools, interfacing tools with Galaxy and installing Galaxy. Maybe would it be OK to ask “Which tool can remove signalP from FASTA file?”, but I am unsure.
I think you could post general questions to biostars.org. May be this post will be interesting for you: https://www.biostars.org/p/281716/ If you start automating some process, you have to learn a little bit of programming and Python is very useful.
Alternatively, you could ask directly to the DB provider, ie UniProt https://www.uniprot.org/help/
Hope this help.
There are several line-command and fasta manipulation utilities in Galaxy as wrapped tools and ways to access functions not specifically wrapped as tools.
Search the tool panel at the Galaxy server where you are working to find tools by keyword search (sed, awk, fasta).
Use a Jupyter notebook in Galaxy: “Visualize > Interactive environments”
The biostars post information about extracting regions from fasta sequence based on coordinates in bed format is probably the fastest method using the tool Extract Genomic DNA using coordinates when working in Galaxy. The coordinates must be in strict BED format and the fasta in “custom genome” format. Once you have a successful process/tool flow designed, save it into a Workflow run the whole process in batch.
You’ll need to transform coordinates to bed and make sure the fasta is in the right format for a custom genome. The help below may mention DNA or genomes/chromosomes but the functions work with just about any bed/fasta data as long as it is formatted correctly and has the right metadata assigned (eg: datatype).
Uniprot’s “Mature” protein annotation comments might be filterable there or you can use basic functions in Galaxy to convert Fasta-to-Tabular, Select lines with keywords/regular expressions, and convert back Tabular-to-Fasta. (Alternative to bed/fasta method).
Contacting Uniprot directly as @SamGG suggests for help is the best way forward, if you want to get these specific mature sequences directly using their tools, which is how your original question was interpreted. Galaxy tools tend to be single-function to facilitate easier workflow design. There are some UniProt data retrieval tools in Galaxy (search tools by keyword) but I don’t think these will filter/get what you want in a single step. But you can try them out using the Galaxy EU server.
How-to help for the above manipulations in Galaxy.