加入具有多个匹配项的data.table [英] Join in data.table with multiple matches
问题描述
我早些时候发布了一个关于在data.table中联接列的问题,其中一列(dep)具有条目的依赖信息。因此,条目3取决于标签为 40的记录。然后,为匹配列分配条目所依赖的标签的ID值。问题发布在这里:比较直到R中某些索引的列
I posted a question earlier about joining columns in data.table, where one column (dep) has the dependence information of an entry . So entry 3 is dependent on a record with label '40'. Then the 'match' column is assigned the id value of the label on which an entry depends. The question is posted here : Comparing columns uptill certain index in R
library(data.table)
trace <- data.table(id=1:10, dep=c(-1,45,40,47,0,45,43,42,45,45),
label=c(99,40,43,45,47,42,48,45,52,67), mark=rep("",10))
id dep label mark
1: 1 -1 99
2: 2 45 40
3: 3 40 43
4: 4 47 45
5: 5 0 47
6: 6 45 42
7: 7 43 48
8: 8 42 45
9: 9 45 52
10: 10 45 67
将导致
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
8: 8 42 45 6
9: 9 45 52 8
10: 10 45 67 8
以下解决方案对我有用:
The following solution worked for me:
trace[, mark := trace[.(dep = dep, id = id), on=.(label = dep, id < id), mult="last", x.id]]
# if not found, use current id
trace[is.na(mark), mark := id ]
对于上述情况,对于重复项,我们使用的是最近的匹配项。
但是,如果我不想保留所有匹配项,而不是匹配到最后一个条目,是否有办法获得类似于此的输出(其中最后一个和倒数第二个条目具有多个依赖性):
For the above case, for duplicates we were using the most recent match. However, if instead of matching to last entry if I want to keep all matches, is there a way get an output similar to this (where last and second last entries have multiple dependencies):
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
8: 8 42 45 6
9: 9 45 52 4,8
10: 10 45 67 4,8
我并不担心这些依赖项的记录格式。使用mult = all,
I am not that concerned about the format in which these dependencies are recorded. A slight modification of the earlier solution using mult="all",
trace[, mark := trace[.(dep = dep, id = id), on=.(label = dep, id < id), mult="all", toString(x.id)]]
结果
id dep label mark
1: 1 -1 99 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
2: 2 45 40 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
3: 3 40 43 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
4: 4 47 45 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
5: 5 0 47 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
6: 6 45 42 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
7: 7 43 48 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
8: 8 42 45 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
9: 9 45 52 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
10: 10 45 67 NA, NA, 2, NA, NA, 4, 3, 6, 4, 8, 4, 8
推荐答案
好,稍作修改:
trace[, mark := trace[.(dep = dep, id = id), on=.(label = dep, id < id),
if (all(is.na(x.id))) NA_character_ else toString(x.id), by=.EACHI]$V1 ]
# if not found, use current id
trace[is.na(mark), mark := as.character(id) ]
它使用 as.character(id)
,因为 mark
现在是一个字符串变量。
It uses as.character(id)
because mark
is now a string variable.
要查看 by = .EACHI
的工作方式,请尝试单独运行此部分:
To see how the by=.EACHI
works, try running this part on its own:
trace[.(dep = dep, id = id), on=.(label = dep, id < id),
if (all(is.na(x.id))) NA_character_ else toString(x.id), by=.EACHI]
评论。我希望这对于较大的表来说不会很好地扩展。另外,该列不再匹配 id
的类型,因此不能用于合并等。列表
-class列会遇到相同的问题:
Comments. I expect this will not scale up well for larger tables. Also, the column no longer matches id
's type, so it cannot be used for merging, etc. A list
-class column would have the same problem:
trace[, mark := trace[.(dep = dep, id = id), on=.(label = dep, id < id),
list(list(na.omit(x.id))), by=.EACHI]$V1 ]
# if not found, use current id
trace[lengths(mark) == 0L, mark := as.list(id)]
这篇关于加入具有多个匹配项的data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!