Selective elimination of Sequences using tools

Hi guys,

I’ll keep it as simple as possible, I’ve got a fasta file containing a library of DNA sequences 40 bases in length, This file contains sequences I want to eliminate.

I have a another file containing all the sequences I need to eliminate but I can’t find a tool that filters based on sequences themselves, not ID’s. I can’t use ID’s as they won’t match up.

I think the TN93 tool may be the best option but I’m unsure how to use it properly, I have a clustered fasta file as it requires but it doesn’t seem to work.

Can anyone offer some tips?

(More detail if you wish to read)

I’m a third year PhD student that has recently finished some SELEX protocols and got some sequencing data back.

I have an initial random library and two post SELEX libraries. In theorey every sequence in the original library should be unique, but they’re not, some have as many as 20,000 copies and so have been carried through to the post SELEX libaries

I wish to eliminate them from the post SELEX libraries



Paul: If I understand correctly: one file contains many sequendes and the other sequences you need to eliminate by matching them to the first file?

Yeah that’s it, however the ID’s of the sequences don’t match between the two files, only the sequences themselves

I tried to convert to tabular and filter using the second (sequence) column but I think I may have got the syntax wrong as it uses REGEX syntax.

Hi @Paul_E_Coombes

Did you solve this yet?

If the sequence is expected to be exact, then a tool like Compare two Datasets would work. Input the larger file with all sequences first, the smaller file with sequences to remove second, and set the “columns for comparison” for both to the sequence (in tabular format) and the “To find” option to “Non Matching rows of first dataset”.

If running that with all the data at once is too large, Split up the larger file into a collection, run in batch, then Collapse the collection into a single result.

Or maybe the sequence reads are not exact? Instead of trying to come up with a regex, maybe map (with BLASTN?), parse out the sequence identifiers in the tabular output, then use those to filter/compare. Both query and target IDs will be in that result and the order of query/target inputs probably doesn’t matter for this tool. You could try both.

Refs Data Manipulation Olympics && Using dataset collections