Removing all duplicated data from a tab delimited file

I am converting kgb ids to RSids using my bim files and the UCSC browser in usegalaxy.org - it works great however for some SNPs I have multiple matches and need to eliminate these duplicate values however I need to remove them all and not leave the first entry as happens with --unique as this value may be incorrect. Is there an easy way to do this

1 Like

There is a tool Sort which has a flag to make the values (lines) unique. You may also use sort + a separate tool called unique.

1 Like

Thanks for getting back to me. The unique command leaves the first duplicate value - I need to remove all values as the first value may not be the correct one.

1 Like

@joanfitz

Try using Group to count up the number of each distinct value, then Filter on that count number (only keep those with a count == 1).

If that breaks your file up, add in a Join step to link back to the original full data line, then rearrange the columns back the way you want them with Cut (plus get rid of the count value). The Galaxy 101 tutorial does some steps that are similar to this if you are not sure how to use the tools serially.

Also, you could put all these into a mini-workflow for reuse if you plan on doing it again. How to make a workflow is also included in the 101.

There are several different text filtering tools. This one would work fine: “Filter data on any column using simple expressions (Galaxy Version 1.1.0)”

Thanks Jennifer, this is very helpful. All the best, Joan

1 Like