Selective elimination of Sequences using tools

Paul_E_Coombes · April 5, 2023, 2:45pm

Hi guys,

I’ll keep it as simple as possible, I’ve got a fasta file containing a library of DNA sequences 40 bases in length, This file contains sequences I want to eliminate.

I have a another file containing all the sequences I need to eliminate but I can’t find a tool that filters based on sequences themselves, not ID’s. I can’t use ID’s as they won’t match up.

I think the TN93 tool may be the best option but I’m unsure how to use it properly, I have a clustered fasta file as it requires but it doesn’t seem to work.

Can anyone offer some tips?

(More detail if you wish to read)

I’m a third year PhD student that has recently finished some SELEX protocols and got some sequencing data back.

I have an initial random library and two post SELEX libraries. In theorey every sequence in the original library should be unique, but they’re not, some have as many as 20,000 copies and so have been carried through to the post SELEX libaries

I wish to eliminate them from the post SELEX libraries

Cheers

Paul

nekrut · April 6, 2023, 5:57pm

Paul: If I understand correctly: one file contains many sequendes and the other sequences you need to eliminate by matching them to the first file?

Paul_E_Coombes · April 11, 2023, 9:00am

Yeah that’s it, however the ID’s of the sequences don’t match between the two files, only the sequences themselves

I tried to convert to tabular and filter using the second (sequence) column but I think I may have got the syntax wrong as it uses REGEX syntax.

jennaj · April 26, 2023, 7:37pm

Hi @Paul_E_Coombes

Did you solve this yet?

If the sequence is expected to be exact, then a tool like Compare two Datasets would work. Input the larger file with all sequences first, the smaller file with sequences to remove second, and set the “columns for comparison” for both to the sequence (in tabular format) and the “To find” option to “Non Matching rows of first dataset”.

If running that with all the data at once is too large, Split up the larger file into a collection, run in batch, then Collapse the collection into a single result.

Or maybe the sequence reads are not exact? Instead of trying to come up with a regex, maybe map (with BLASTN?), parse out the sequence identifiers in the tabular output, then use those to filter/compare. Both query and target IDs will be in that result and the order of query/target inputs probably doesn’t matter for this tool. You could try both.

Refs Data Manipulation Olympics && Using dataset collections

Topic		Replies	Views
How to retrieve reads from fasta file based on a list of identifier? name fasta-manipulation , text-manipulation	4	1095	November 4, 2020
Extract subsequence from FASTA/Q file usegalaxy.eu support fasta-manipulation	3	403	August 21, 2023
Filtering reads based on a sequence usegalaxy.org support filter , quality-control	2	658	April 7, 2021
Extract all reads from multi fasta that have a certain sequence usegalaxy.org.au support	2	1173	February 14, 2022
Filter FASTA Test testTest test tool-help , filter_by_fasta_ids	0	53	April 15, 2024

Selective elimination of Sequences using tools

Related topics