Regular expression to filter for header using filter FASTA

After using Shovill to assemble paired-end NGS reads for DNAseq data I am hoping to filter my FASTA output file based off the coverage of each assembled contig. The output for Shovill is a FASTA file that contains a unique contig ID and a coverage value. Using “filter FASTA” on galaxy, I want to output a FASTA file with contigs that have a coverage greater than 1000. Can I use “filter FASTA” for this and if so what is an example of the regular python expression that would work to filter for coverage?

Thank you very much in advance.

1 Like

Hi @Gabriella_Quinn

Yes, if the coverage value is somewhere in a line of text data, it can be parsed out and/or filtered on.

The tool you mention is a good choice, but it isn’t installed at Galaxy Main https://usegalaxy.org. It is installed at Galaxy EU https://usegalaxy.eu. The post is tagged as usegalaxy.org (??) – may be a simple mistake? Please clarify the server by URL and that this is the tool you intend to use:

  • Filter FASTA on the headers and/or the sequences (Galaxy Version 2.1)

If you share an example of a few of your fasta title lines (the full “>” line for at least three fasta records) we can help you to construct a regular expression. If your contig fasta dataset is large (too large to scroll through to copy/paste out three title lines) you can filter out just the title lines with the tool/expression:

  • Select lines that match an expression (Galaxy Version 1.0.1)
  • using the regular expression ^>.

The Select tool could also be used for the full fasta filtering, but that would require converting the data to tabular format first, then back to fasta again after. If you want to do it that way, just state so. The regular expression to use will be a bit different between the two tools.

Should you decide to post back a few of your fasta title lines, be sure to preserve the formatting by using the “block quote” format option (“quote” icon at the top of where you write your reply back to this post). That should be enough information, but if not, I’ll ask you to share a history link with me (privately) to review your exact data. Whitespace (spaces, tabs, etc) can look the same with copy/paste functions, and sometimes matter.

Let’s start there. Thanks!

Thank you for your response. Yes! This is meant for usegalaxy.eu. My FASTA files are relatively small and the first three lines of one of my samples are posted below. These contigs are pretty short because they are sequences from a viral genome of about 1200bp.

I am hoping to filter for contigs that have a coverage greater than 500.

I appreciate your help with this.

> contig00001 len=187 cov=233.9 corr=0 origname=NODE_1_length_187_cov_233.928571 sw=shovill-spades/1.0.4 date=20190911
GAAGTTCCTCTTCCTCCTCCTTGTTCAGGCGCTTCCCTCCCGCGCTCAGCTGCTTTCTCTGTTCTCGAGGGCCTTCCTTCGTCGGTGATCCTGCCTCTCCTTGTCGGTGAACCCTCCTTGAGGGGCCTCTTCCTAGGTCCGGAGTCTACTTCCATCTGGTCCGTCCGGGCCTTCTTCGGGGGG
> contig00002 len=187 cov=24.1 corr=0 origname=NODE_9_length_187_cov_24.107143 sw=shovill-spades/1.0.4 date=20190911
GGAGTTCCTCTTCCTCCTCCTTGCTCAGGTTCTTCCCTCCCGCGGTCAGCTGCTTTCTCTGTTCTCGAGGGCCTTCCTTCGTCGGTGACCCTGCCTCTCCTTGTCGGTGAACCCTCCTGAGAGGCCTCTTCCTAGGTCCGGTGTCTACTTCCATCTGGTCCGTCCGGGCCCTCTTCGCGGGG
> contig00003 len=187 cov=62.5 corr=0 origname=NODE_6_length_187_cov_62.511905 sw=shovill-spades/1.0.4 date=20190911
TGAAGTTCCTCTTCCTCCTCCTTGCTCAGGCGCTTCCCTCCCGCGCTCAGCTGCTTTCTTGTTCTCGAGGGCCTTCCTTCGTCGGTGATCCTGCCTCCCCTTGTCGGTGAACCCTCCTGAGAGGCCTCTTCCTAGGTCCGGAGTCTACTTCCATCTGGTCCGTCCGGGCCTTCTTCGCGGG

note: admin reformatted

1 Like

Hi @Gabriella_Quinn

Just to make sure I have the formatting correct, is there really a space in between the > and contigNNNN content in your dataset?

I did change the way the formatting was displayed here (forgot that “block quote” doesn’t handle lines that start with a > very well). Adding in “four spaces” before each line worked better quote it accurately and I think it preserved your original intended format, but need to confirm that. Please review what is displayed now.

A space in that location will cause problems with any tool expecting fasta format. Please clarify, then we can get that removed with a different tool, and proceed with the Filter Fasta tool.

FAQ for fasta formatting. Don’t worry about the description line content, you want that preserved for your current use case, so it can be filtered on. But the reads do need to have a correct identifier. Each part of the “>” fasta title line is explained in the FAQ if this seems confusing. https://galaxyproject.org/learn/datatypes/#fasta

Thanks!

Thank you for clarifying. There is not a space. Below is an example from one line of the FASTA file. Other than that your corrected format matches my file.

>contig00004 len=193 cov=13.3 corr=0 origname=NODE_863_length_193_cov_13.322222 sw=shovill-spades/1.0.4 date=20190915
GAAGGAAAGACCGCGGGGGGAGGGAAGAGATC

note: admin reformatted

1 Like

@Gabriella_Quinn

Ok, good!

Try this expression to find reads with coverage at or above 500.0

^.+cov=([5-9][0-9][0-9])|([1-9]\d{4}\d*)\..+$

This worked perfectly.

Thank you very much for your help!

1 Like