Filter tabular data columns by arbitrary list

swbioinf · March 7, 2024, 6:15am

Hello, I’m trying to figure out the galaxy way of subsetting columns by an arbitray list of column names.

E.g. given a tabular file with irregular column names like so:

gene	A B	sampleC D
geneA	1	2	2	2
geneB	0	0	0	4

How could I pull out colums, assuming I have them in a single column text file like so; (Or a comma separated string )

gene
A
D

Yeilding

gene	A D
geneA	1	2
geneB	0	4

The issue is that the list of columns to keep can change, can be of arbitrary length, and can’t be hardcoded into knowing its the 1,2,5 column. This is to be part of a workflow.

I can’t find a tool that does this directly - perhaps I’ve missed it? (please do tell me I’ve missed it :))

My thinking is I could do it something like the following:

Melt into long format with the ‘Table Compute’ (melt) tool
Do a inner join of long format with the desired column list with ‘Join two Datasets side by side on a specified field’
‘Pivot’ the filtered table wider with ‘Table compute’ (pivot)
But, how do I then put the columns back in a certain order?
a. Maybe in this case I could use ‘column arrange’ to get ‘gene’ up front if I don’t particularly care about the rest.
b. Is there a general solution, like if I just wanted to match the order of my columns-to-keep?

But that seems somehwat convoluted, so I think I’m missing something obvious? Can anyone point me in the right direction please?

Thanks,
Sarah.

igor · March 8, 2024, 5:44am

Hi Sarah,
what about transpose > join > transpose
Kind regards,
Igor

swbioinf · March 10, 2024, 10:43pm

Thanks Igor - Yes - transpose should do the trick!

And I’ll use column arrange to get ‘gene’ back up front.

(side thought - wouldn’t a galaxy table manipulation cheat sheet be nice: a la https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf or https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf )

jennaj · March 12, 2024, 9:43pm

Hi @swbioinf – jumping in – but that would be a fantastic addition to the “Data Olympics” series we have already going at the GTN

See here for what we have so far. You could template off any. → GTN Materials Search

And here for how to contribute → Contributing to the Galaxy Training Material / Tutorial List. The GTN people would help (most are also community volunteers), and you’d get full attribution with stable public resource links!

swbioinf · March 14, 2024, 1:54am

Thanks jennaj - That cheatsheet part of that tutorial was exactly the sort of thing I’m looking for.

I’ll probably do a few things along this line (trying to automate some manual manipulation rubbish), so will keep notes.

Topic		Replies	Views
Filter data on any column using simple expressions text-manipulation , tool-help , cut1 , tp_cut_tool , filter1 , tp_awk_tool , grep1 , tp_sed_tool	4	2290	May 14, 2021
Create a new column in a table with specific patterns usegalaxy.eu support text-manipulation	2	28	August 13, 2024
Specify column to group by -- Join two datasets -- Fixed text-manipulation , public-galaxy-server , server-side-error	9	2079	May 20, 2020
Cut tool (Text Manipulation) doesn't rearrange columns? usegalaxy.org support text-manipulation , tool-help , cut1 , tp_cut_tool	7	2133	June 18, 2019
Its is posible to change the column headers of a tabular file in my galaxy histtory usegalaxy.org support server-admin , text-manipulation	5	798	August 11, 2023

Filter tabular data columns by arbitrary list

Related topics