Removal of spaces from fasta file

fasta-manipulation
fasta
custom-genome

#1

I have large fasta file, which look like this after removal of few sequences.
JE116607_g1_i1 len=581 path=[559:0-580]

ATAAATATGATGAAAAGAGCTGTCTACATTAGGAAAACAAGGGAAATAGGATCCCTCTTC

TAAGCCATAATTGGTAAACAGTCATAAGAAATGAATCATAATCTTGTGGAGATGCATTCC

AGTGTCTTCTCACTAGACCCAGATTTGCAAGAAAGCTTGCATACATTTCCTCAG

AAATAGAAAGCAAGATTGCCTTGAGGTTTTGAACATCATTGTGGCTTATAACAATTGAGA

ATTTACTCCAATCCAAAACATTGGCAAAGGGTAGATCATAGTAATCAGAAATGATAACGG

GAACACAACCAAAATGAATAGCATCTGAGATCCTAGCTGTATTCACTTCATAGCCTTTAA

CATGTAGACAATACTTGCTTTTTGTCAAGCCTTCACGTGAATCATGAAATTTTCCAGAGA

TAAGCATTGAAGTATCATTCTCCCACATTAGTCGCTCACGAATTCTGGAATTCT

TCTCTCGCCCAGCAAAAAACGTGGTCCTTTCACTTGGAGGAACTAAACTTGTTT

CTGGCAGCCGAGGCCATACTTGCGGCAAGGCTACATCTTTA

I want to remove spaces at the end of lines.i am window user.


#2

Run the fasta dataset through the tool NormalizeFasta.

Tips:

  • Fasta datasets have fasta identifier lines that start with a > character, otherwise tools will not recognize the format/datatype. Do this before loading the data into Galaxy and using the tool NormalizeFasta.

    Meaning that lines in this format:

    JE116607_g1_i1 len=581 path=[559:0-580]

    Should be modified to be in this format:

    >JE116607_g1_i1 len=581 path=[559:0-580]

  • Some tools can handle unwrapped fasta data lines and some cannot. Wrapping at 80 bases is a common choice and most likely to work well with most tools. Any consistent length between 40-80 will work with many. And some tools do not care about wrapping but these are mostly mapping tools – so you’ll have trouble with unwrapped fasta data with the downstream tools used. It is a good idea to wrap the formatting at the beginning of an analysis, especially if you plan to use the fasta as a Custom Genome or as another input to a tool. NormalizeFasta can do the wrapping operation.

  • Also, trimming the fasta identifier lines at the first whitespace is often important, and required if using the data as a Custom Genome. All the identifiers in any single fasta dataset must be unique, so check that this would be true in your data before doing that and make adjustments, if needed, before loading the data into Galaxy. NormalizeFasta can do the trimming operation.

  • Not formatting fasta data correctly can lead to tool errors, whether working in Galaxy or otherwise.

FAQs: https://galaxyproject.org/support/


#3

Dear Jennifer

Thx alot for your reply. I am window user and normalizefasta tool is not run on window. Kindly suggest tool which can be run on window. Thx


#4

Hi - This forum is for help with the Galaxy platform. It runs through a browser window at publicly hosted sites.

These and more choices are listed here: https://galaxyproject.org/use/

For help with line-command bioinformatics, try forums such as https://www.biostars.org/ and https://bioinformatics.stackexchange.com/.

Hope that helps to clarify.


#5

I guess the confusion here may come from this sentence you wrote earlier, @jennaj:

@humaira what this was supposed to say (I guess you read it differently) was that you should add the > symbol in front of each sequence title that doesn’t have it yet before uploading your data to the server. Then, as a next step, use the NormalizeFasta tool on the server to fix the remaining issues with your input.