Compute distance matrix

When computing a distance matrix for gene orders it appears that the results in the matrix are inversed in that a smaller number is more similar and a larger number is more dissimilar and the diagonal results in zeros. These results are repeated with all the distance measures. The past version of CREx was opposite where the higher the value the more similar the genomes and the diagonal was 1326. I checked this using old data where I used the old version of CREx and am still getting the same issue so I do not believe it is my new dataset. I bring this up due to the interpretation of the test for when I go to publish any work.

Thanks for any help

Hi @LucasJennings

The tool author visits this forum, let’s see how they state to interpret the data outputs. ping @bernt-matthias

Meanwhile, you could share some examples. Which prior tool version? Do you have a simple comparison example that shows the difference? How to share → How to get faster help with your question

Recent Q&A about usage, in case that might be related → CREx: "missing header" error

Yes I can share all of that. I am posting a document showing the output I got yesterday using the compute distance matrix tool on Galaxy and a matrix that was made using the old CREx webserver. The new matrix is at the top and the old one is at the bottom of this document.

Hi @LucasJennings

Are you able to compare the two runs using the command-line for each, to see if that shows if different options/methods were used?

The Galaxy command-line is under the “i” icon – scroll down on that page to find it.

Maybe you can also find this in the other web application?

Other than that, maybe different versions of the tool itself is the root change. That said, your differences seems more like a parameter change.

We can ping one of the tool authors again who would know much more of course! Hi @bernt-matthias would you be able to suggest what to try next? Thanks!

Dear @LucasJennings,

thanks for bringing this up. You are completely right. The old website had three measures that the user could choose from:

  • number of common intervals (a similarity measure)
  • number of breakpoints (a distance measure)
  • reversal distance (a distance measure)

For the coloring of the matrix we transformed the distance measures into a similarity measure (by n - dist, where n is the number of genes) – but this point is not relevant for your case, since you used the number of common intervals.

When I had to resurrect the functionality of the website I resorted to the program distmat that had already the same functionality … with the important difference that a common interval distance is computed, i.e. (n * (n-1) + 1) - X for linear genomes and ((n-1)*n) + 1) - X for circular genomes, where n is the number of genes and X the number of common intervals. Here the first part of the difference is the maximum number of common intervals (which is achieved when comparing equal genomes, i.e. the value on the diagonal in your original tables).

The advantage of this change was in my opinion to have consistently distance measures in the output.

The second reason that made me do this change is that I added the CREx distance, i.e. the number of rearrangements computed by CREx. It was a huge oversight that the old CREx website used all sorts of distance measures to compute that matrix (that was only intended to select the actual pairwise CREx comparisons to be shown) but not the CREx distance. As I was thinking about the change (to compute the common interval distance instead of the number of common intervals) I now faced the problem that the CREx distance does not have a maximum that I can easily compute (probably it’s n, but I would need a formal proof) – so I could not make this a similarity measure.

So long story short: in the end I just made everything distance measures. And I certainly should have documented this better.

What I could implement is a boolean flag that would turn the breakpoint, inversion and reversal distances into similarity measures (and result in an error for CREx distances). What do you think? Would this be helpful?

Cheers,
Matthias

1 Like

@bernt-matthias I think it would be good to have continuality from what was used in previous literature. Most of the literature I see follows the common intervals where a higher number is more similar as they share more common intervals. Would this be possible to implement? If not, I will just use the similarity computed by the common intervals on Galaxy.

I opened an issue and hope to find time soon to fix it: distmat distance / similarity measure (#4) · Issues · Matthias Bernt / revoluzer · GitLab

1 Like

@bernt-matthias thank you very much!