MOTHUR filter.seqs help

ameliak · May 9, 2025, 2:10am

Hi everyone,
I’ve been following the mothur tutorial to perform my metagenomics analysis, and having trouble with the filter.seqs step in the data cleaning after sequence allignment.
When I use the settings as in the tutorial, all of the positions in my allignments are removed and so my fasta file is an empty list of sequences.

Screenshot 2025-05-09 at 12.08.17 pm

Removing the “trump”: . setting, retains my sequences, but I am confused why the length of my filtered allignment is still quite long.
Screenshot 2025-05-09 at 12.09.09 pm

I am thinking it might be an issue with the alignment as I wasn’t so sure about the database I used, but the outputs for all the steps out to this point look as I expect, so I’m not sure.

Does anyone have any guidance on the outputs I’ve been getting, and if it would be okay to proceed with the long filtered allignment? Thanks in advance for your help!

jennaj · May 9, 2025, 9:46pm

Hi @ameliak

I have a completed history here with the tutorial data here run through the tutorial’s workflow. Maybe you can compare the usage and notice where things are different?

https://usegalaxy.org/u/jen-galaxyproject/h/training-16s-rrna-sequencing-with-mothur-main-tutorial-5

This is the filter step to help you navigate the history.

https://usegalaxy.org/datasets/f9cad7b01a4721354bda2cc5b96cad18/details

These are the statistics from the log.

Length of filtered alignment: 376
Number of columns removed: 13049
Length of the original alignment: 13425
Number of sequences used to construct filter: 16298

As a pure guess, you could seems to be based on the number of 16s reads that are used with this pipeline, not the smaller list of assemblies from the earlier steps (mothur align.seqs output).

Hope this helps!

ameliak · May 11, 2025, 11:19pm

Thanks for your help!

I’m still a bit confused why all the columns are removed when I use the trump character, as this seemed like a standard selection from the tutorial.

And then I don’t really understand why the length of my filtered alignment is much longer than the amplicon size, but if it could just be because I have many more samples than in the tutorial I guess that should be okay.

Thanks for the history, it seems to match well with what I have done so I will try proceeding to the next steps!

jennaj · July 23, 2025, 6:49pm

Hi @ameliak

So sorry! I didn’t see the followup questions.

For this

The . characters (“trump”) are representing gaps in the sequence alignments. The filter step is removing any alignments that are not an exact match throughout the top level sequence the others are being clustered into. If everything is being removed, that means you do not have any “representative sequences” at a top level. This is very strange.

This next part is another clue

If the alignment is longer than the amplicon size, and all sequences include gaps, that means something went wrong with the upstream steps. I wouldn’t expect your exact results with only misapplied data cleaning.

Even mixed up samples wouldn’t explain your results but maybe I am missing something. If you are still having problems and want to share your history back, that might help too.