Removing all duplicated data from a tab delimited file

joanfitz · November 14, 2019, 11:50am

I am converting kgb ids to RSids using my bim files and the UCSC browser in usegalaxy.org - it works great however for some SNPs I have multiple matches and need to eliminate these duplicate values however I need to remove them all and not leave the first entry as happens with --unique as this value may be incorrect. Is there an easy way to do this

bernt-matthias · November 16, 2019, 9:52am

There is a tool Sort which has a flag to make the values (lines) unique. You may also use sort + a separate tool called unique.

joanfitz · November 16, 2019, 10:48am

Thanks for getting back to me. The unique command leaves the first duplicate value - I need to remove all values as the first value may not be the correct one.

jennaj · November 19, 2019, 10:25pm

@joanfitz

Try using Group to count up the number of each distinct value, then Filter on that count number (only keep those with a count == 1).

If that breaks your file up, add in a Join step to link back to the original full data line, then rearrange the columns back the way you want them with Cut (plus get rid of the count value). The Galaxy 101 tutorial does some steps that are similar to this if you are not sure how to use the tools serially.

Also, you could put all these into a mini-workflow for reuse if you plan on doing it again. How to make a workflow is also included in the 101.

There are several different text filtering tools. This one would work fine: “Filter data on any column using simple expressions (Galaxy Version 1.1.0)”

joanfitz · November 20, 2019, 10:31am

Thanks Jennifer, this is very helpful. All the best, Joan

Topic		Replies	Views
remove duplicates from a blast output mapping , blast , data-manipulation	1	700	January 21, 2019
Filter data on any column using simple expressions text-manipulation , filter , cut	4	1993	May 14, 2021
How to remove duplicates in a concatenated paired dataset? usegalaxy.org.au support workflow , metagenomics , mothur	0	350	September 16, 2021
Filter tabular data columns by arbitrary list usegalaxy.org.au support text-manipulation	4	80	March 14, 2024
collection operations column join, result has missing datasets and unnamed columns usegalaxy.org support text-manipulation , relabel-collection , collections	5	604	September 20, 2021

Removing all duplicated data from a tab delimited file

Related Topics