collection operations column join, result has missing datasets and unnamed columns

Hey community,

I have a collection with a large number of datasets and I need to combine into a single file .
The datasets in the collection have 2 columns (c1=count data; c2=key) and are labeled with sample names.
The goal is to combine them to have 1 file with columns = sample names (names of individual data in og set) and rows = count info for each key.
Most, but not all, of the keys in c2 of original file are shared between datasets.

When I run the tool “column join” from the collection operations I have some issues with the resulting dataset and I’m not sure what’s going on.
The issues are:

  1. Not all of the samples from the first dataset are included in the join dataset - some are missing (>100).
  2. There are columns in the join dataset that do not have a sample name, but have count data. (note: the number of unlabeled columns does not equal the number of samples from the collection that are missing).

I’m not really sure what’s going on and any insights would be greatly appreciated!!
Thank you!

1 Like

Hi @SarahCraig

This is likely where the problem is. Only data with matching keys will be output. The other columns are optional and can be filled with replacement text specified on the tool form if needed, to preserve the structure of the matrix.

Important items to pay attention to:

  1. Keys should be unique within each input dataset
  2. The order of keys should be the same between all input datasets Not required
  3. Keys should be all “one_word” and not include any spaces
  4. All inputs should be in the same list collection
  5. List identifiers (eg: dataset names, presumably these are the sample names?) should be unique within that list collection
  6. List identifiers should also be “one_word” with no spaces
  7. If any datasets within that input list collection contain a header, all datasets should include a similar header (same number of lines), and the option “Number of header lines in each input file” is set to match the inputs.
  8. The input collection needs to be a flat list of datasets (not nested) with the datatype “tabular”.
  9. If you think there might be missing values, make sure the replacement text doesn’t include any spaces. You might want to specify this anyway – add some text that is unique so you can search for it when double-checking the output.

If all datasets were produced by the same protocol/workflow, all of the above is probably true already. If you merged together datasets from different protocols, you’ll need to check and make changes as needed.

Dataset collection tutorials: Search Tutorials

Let’s start there.

1 Like

Of the items you listed the only one that may be not true is #2 - the order of the keys is the same in each dataset. All of the other important items are met. Is there a way to change the order of the list for each dataset - I have >1000 datasets so I really don’t want to have to do it manually to each.

I respectfully disagree that having keys not shared between the datasets is the problem. Looking at the “using dataset collections” tutorial, in the section “collection operations” and subsection “tools that combine elements of a collection” the first entry is the column join tool.- The image there shows exactly what I am trying to do and the “keys” (the numbers in this case) are not shared between all the datasets. The tool puts a filler (in this case a dot) in the column of the datasets that have no value for that key.

Thanks for your help, it’s much appreciated!!!

2 Likes

Hi,

The question was pretty broad so I was suggesting things I would try if the same tool wasn’t producing expected results, eg: reduce the inputs to the simplest format. Just tested a set of combinations and it worked without having the keys in the same order or all keys in each dataset. Those two tests are here if interested: Galaxy | Accessible History | test coll join

Maybe check to see if any of your datasets have content like this? Any might cause problems.

CountValue - Tab - EmptyValue - EndOfLine

or

CountValue - EndOfLine

or

EmptyValue - Tab - EmptyValue - EndOfLine

or even just a blank line like:

StartOfLine - EndOfLine

Those cause problems with a few other tools and the solution is usually to remove blank lines or lines that do not contain content for the primary key. The tool Select is good at removing specific lines with regular expressions to do some data cleaning, and can be used in batch with a collection input in plain text formats (tabular, txt, fasta, fastq, bed, etc).

Examples of Select with the “NOT Matching” option set to “yes” and regular expressions. I usually set the option to “no” to see what is getting removed first (if anything), then run on data to clean it up. The original data isn’t changed so this can be run repeatedly until correct as a way of validating input content.

Remove empty lines:

^$

Remove lines with numerical data in the first column but nothing in the second. The first traps a tab, the second doesn’t.

^\d\d*\t$
^\d\d*$

Remove (or become aware of) lines that contain more than two columns of tab separated data:

^.+\t.+\t.+$

You could also do something like using the tool Cut to extract the second column from each dataset, then Collapse Collection into single dataset in order of the collection to put all in one dataset, then run the tool Group to count up the number of unique values. That could also reveal content or format issues.

Hope that helps!

Thanks for the suggestions! I went through the checks and have verified the data formatting.

I tried using the column join tool again and have all the samples there now (the number of columns matches the number of samples), but am still missing names for some of the columns (i.e. there are counts in the column but no sample name).

1 Like

Hi @SarahCraig

I’m sending you a direct message – we’d like to review the run to see what is going on. Thanks!