选择唯一的x和y对以最小化值 [英] Select unique x and y pairs to minimize a value

查看：91 发布时间：2020/10/15 20:59:50 r dplyr data.table

本文介绍了选择唯一的x和y对以最小化值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要为每个ID.y（形成唯一对）选择一个唯一的ID.x，以便从最低的距离值开始最小化距离值。我觉得这有点像数独谜题，因为每个x和y只能使用一次，因此每个对中的信息都可以匹配其他对。

I need to select a unique ID.x for each ID.y (forming unique pairs) that minimizes a distance value, starting from the lowest distance values. I feel like it's a bit like a sudoku puzzle because each x and y can only be used once, so information from each pair allows for matching other pairs.

在下面，ID.x 55与ID.y 1的匹配度比ID.x 56更好，因为ID.x 56与ID.y 2匹配度更高。类似地，ID.x 58可以与ID.y 2匹配。 4，因为任何其他可用的选项都将具有更大的距离，然后ID.y 5可以将ID.x 59带到距离4。但是，ID.y 7无法匹配，因为ID.x 61和ID.x 62相等关闭。

In the example below, ID.x 55 is a better match for ID.y 1 than ID.x 56 is, because ID.x 56 is a better match for ID.y 2. Similarly, ID.x 58 can be matched to ID.y 4, because any other available option would be a greater distance, and ID.y 5 can then take ID.x 59 at distance 4. However, ID.y 7 cannot be matched because ID.x 61 and ID.x 62 are equally close.

示例：

DT = data.table(
  ID.x = c("55", "55", "55", "55", "55", "55", "55", "56", "56", "56", "56", "56", "56", "56", "57", "57", "57", "57", "57", "57", "57", "58", "58", "58", "58", "58", "58", "58", "59", "59", "59", "59", "59", "59", "59", "60", "60", "60", "60", "60", "60", "60", "61", "61", "61", "61", "61", "61", "61", "62", "62", "62", "62", "62", "62", "62"),
  ID.y = c("1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7"),
  distance = c("2", "3", "3", "4", "6", "6", "7", "2", "1", "2", "5", "5", "5", "6", "4", "4", "3", "5", "5", "5", "6", "5", "5", "5", "4", "4", "5", "6", "7", "7", "7", "6", "4", "6", "7", "6", "6", "6", "6", "4", "2", "5", "7", "7", "7", "7", "5", "5", "5", "6", "6", "6", "6", "4", "4", "5")
  )

目标：

   ID.x ID.y distance
1:   55    1        2
2:   56    2        1
3:   57    3        3
4:   58    4        4
5:   59    5        4
6:   60    6        2
7:   NA    7        NA

第一次尝试，受到这个问题的启发，无效：

DT[DT[, .I[distance == min(distance)], by=ID.x]$V1][DT[, .I[1], by = ID.y]$V1]

更新：
针对@ chinsoon12和@paweł-chabros的回答，以下是更新的data.table，该表修复了一些东西。它交换x和y（我最初的问题是将x与y匹配，但更自然的解释是将x与y匹配）。此示例删除了ID.y 7的歧义匹配。在此示例中，最小距离匹配了ID.x63。另外，我还添加了一个新的ID.y 8，以澄清何时不可能进行明确匹配（它匹配ID。 x 64和65也一样）。答案不应随意选择匹配项。

UPDATE: In response to the answers by @chinsoon12 and @paweł-chabros, here is an updated data.table that fixes a few things. It swaps x and y (my original question was matching x's with y's, but the more natural interpretation is y with x). This example removes the ambiguous matching for ID.y 7. In this example, the lowest distance matches ID.x 63. Separately, I also added a new ID.y 8, to clarify when no unambiguous match is possible (it matches ID.x 64 and 65 equally well). The answer should not select a match arbitrarily.

DT = data.table(
  ID.y = c("55", "55", "55", "55", "55", "55", "55", "55", "56", "56", "56", "56", "56", "56", "56", "56", "57", "57", "57", "57", "57", "57", "57", "57", "58", "58", "58", "58", "58", "58", "58", "58", "59", "59", "59", "59", "59", "59", "59", "59", "60", "60", "60", "60", "60", "60", "60", "60", "61", "61", "61", "61", "61", "61", "61", "61", "62", "62", "62", "62", "62", "62", "62", "62", "63", "63", "63", "63", "63", "63", "63", "63", "64", "64", "64", "64", "64", "64", "64", "64", "65", "65", "65", "65", "65", "65", "65", "65"),
  ID.x = c("1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8"),
  distance = c(2, 3, 3, 4, 6, 6, 7, 15, 2, 1, 2, 5, 5, 5, 6, 15, 4, 4, 3, 5, 5, 5, 6, 15, 5, 5, 5, 4, 4, 5, 6, 15, 7, 7, 7, 6, 4, 6, 7, 15, 6, 6, 6, 6, 4, 2, 5, 15, 7, 7, 7, 7, 5, 5, 6, 15, 6, 6, 6, 6, 4, 4, 10, 15, 11, 11, 11, 11, 11, 11, 5, 12, 11, 11, 11, 11, 11, 11, 11, 1, 11, 11, 11, 11, 11, 11, 11, 1)
  )

预期结果：

   ID.y ID.x distance
1:   55    1        2
2:   56    2        1
3:   57    3        3
4:   58    4        4
5:   59    5        4
6:   60    6        2
7:   63    7        5
8:   NA    8        NA

我正在使用此代码来使用stringdist_join完成模糊连接，如本问题所述。我有两个需要匹配的数据集（因此，ID.x和ID .y）。就我而言，我的测验前和测验分数需要与多个不可靠的特征相匹配。

I'm using this code is to complete a fuzzy join using stringdist_join, as described in this question. I have two datasets that need matching (hence the ID.x and ID.y). In my case, I have pre-test and post-test scores that need to be matched by multiple unreliable characteristics.

选择唯一的x和y对以最小化值 [英] Select unique x and y pairs to minimize a value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

选择唯一的x和y对以最小化值 [英] Select unique x and y pairs to minimize a value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭