I’m currently trying to run a sort on some bed files. The instructions for the sort are:
sort -k1,1 -k2,2n -k3,3n
however, I’m having difficulty translating this to the sort tool in Galaxy. Is the K1, k2, k3 referring to the columns? and if so, what do 1, 2n and 3n refer to? Any help would be appreciated.
Means to start sorting at the first column and stop sorting this way at the first column (otherwise the sort would apply from whatever the “start” was until the end of the line). Limiting to the first column is important because the “sorting type” to apply differs by column. In this case, the sorting is “alphabetical”. But see the tool form – alphabetical for any letters in the first column (a-b) will be what humans expect but any numbers after may not be. Example: “11” comes before “2” because the first character of “11” (1) is smaller than the first character of “2” (2).
-k2,2n
Sort the second column with a “numerical” sort (“n”) – that means smallest to largest number. For this case, “2” would be smaller than “11” and listed first.
-k3,3n
Sort the third column, same rules as the second column (numerical aka “n”).
In practical terms, these methods will “coordinate sort” the first three columns of a bed file. Most bioinformatics tools are designed to interpret chromosome positions this way for bed datasets. The rest of the line isn’t considered, just passed through the tool associated with whatever was originally included per line after the “chrom-start-stop” data.
Try different sort conditions on your data a few different ways, you’ll notice the difference. The tool form also has a few examples that should help to clear up why the command-line was written that way.
Thanks for the detailed reply. That makes it a lot clearer for me, I’ve tried running it and it looks like it’s working. I had been contemplating download files and running them through cut & sort on my conda build and re-uploading to galaxy. Glad I won’t have to.