Removing all duplicated data from a tab delimited file

joanfitz · November 14, 2019, 11:50am

I am converting kgb ids to RSids using my bim files and the UCSC browser in usegalaxy.org - it works great however for some SNPs I have multiple matches and need to eliminate these duplicate values however I need to remove them all and not leave the first entry as happens with --unique as this value may be incorrect. Is there an easy way to do this

bernt-matthias · November 16, 2019, 9:52am

There is a tool Sort which has a flag to make the values (lines) unique. You may also use sort + a separate tool called unique.

joanfitz · November 16, 2019, 10:48am

Thanks for getting back to me. The unique command leaves the first duplicate value - I need to remove all values as the first value may not be the correct one.

jennaj · November 19, 2019, 10:25pm

@joanfitz

Try using Group to count up the number of each distinct value, then Filter on that count number (only keep those with a count == 1).

If that breaks your file up, add in a Join step to link back to the original full data line, then rearrange the columns back the way you want them with Cut (plus get rid of the count value). The Galaxy 101 tutorial does some steps that are similar to this if you are not sure how to use the tools serially.

Also, you could put all these into a mini-workflow for reuse if you plan on doing it again. How to make a workflow is also included in the 101.

There are several different text filtering tools. This one would work fine: “Filter data on any column using simple expressions (Galaxy Version 1.1.0)”

joanfitz · November 20, 2019, 10:31am

Thanks Jennifer, this is very helpful. All the best, Joan

Topic		Replies	Views
remove duplicates from a blast output mapping , blast , data-manipulation	1	765	January 21, 2019
what tools approprite for remove duplicate reads from BAM file in Usegalaxy mapping , usegalaxy , quality-control , picard_markduplicates	3	1933	April 19, 2021
Missing information VCFtoTab-delimited usegalaxy.eu support	16	2335	November 15, 2019
Selective elimination of Sequences using tools usegalaxy.org support blast	3	428	April 26, 2023
Delete lines from vcf with ALT="."	5	1510	November 6, 2019

Removing all duplicated data from a tab delimited file

Related topics