r按行比较两个data.tables [英] r compare two data.tables by row
问题描述
我有两个要比较的数据表。
I have two data.tables that I want to compare.
但是不知道为什么会有警告
But don't know why there is a warning
DT1 <- data.table(ID=c("F","A","E","B","C","D","C"),
num=c(59,3,108,11,22,54,241),
value=c(90,47,189,38,42,86,280),
Mark=c("Mary","Tom","Abner","Norman","Joanne",
"Bonnie","Trista"))
DT2 <- data.table(Mark=c("Mary","Abner","Bonnie","Trista","Norman"),
numA=c(48,20,88,237,20),
numB=c(60,326,54,268,89),
valueA=c(78,34,78,270,59),
valueB=c(90,190,90,385,75))
DToutput <- DT1[(num > DT2$numA & num < DT2$numB &
value > DT2$valueA & value < DT2$valueB)]
我的目标:
我想找到 num
和 value
基于 Mark $ c
DT1
中的$ c>,范围为 numA
和 numB $
DT2
中的c $ c>。
I want to find num
and value
based on Mark
in DT1
, and there is a range of numA
and numB
in DT2
.
例如:
对于 DT1
中的行 F
, num = 59
和 value = 90
,并且 Mark = Mary
。因此,当使用 by = Mary
时,还必须匹配:
For row F
in DT1
, num = 59
and value = 90
, and Mark = "Mary"
. So, when using by=Mary
, you must also match:
num(59 )> DT2 $ numA(48)& num(59)< DT2 $ numB(60)&值(90) DT2 $ valueA(78)&值(90)< DT2 $ valueB(90)
您可以看到 90< 90
不成立,因此我的结果将没有行 F
You can see that 90 < 90
does not hold, so my result will not have row F
此警告:
Warning messages:
1: In num > DT2$numA : longer object length is not a multiple of shorter object lengt
2: In num < DT2$numB : longer object length is not a multiple of shorter object lengt
3: In value > DT2$valueA : longer object length is not a multiple of shorter object lengt
4: In value < DT2$valueB : longer object length is not a multiple of shorter object lengt
如何将其修改为完成我想做的事?
How can I modify it to complete what I want to do?
谢谢
添加:DT2中可以使用多个相同的标记,但是值不在同一范围内。这会影响比较吗?
Added: Multiple identical Marks may be used in DT2, but the values are not the same range. Does this affect the comparison?
推荐答案
这通常是您要查找的内容吗?我加入了数据表,并使用之间
进行了过滤。如果这不是您要查找的内容,是否可以发布预期输出的数据表?
Is this generally what you are looking for? I joined the datatables and filtered using between
for your conditions. If this is not what you are looking for, can you post a datatable of your expected output?
library(data.table)
DT1[DT2, on = "Mark"][between(num, numA, numB, incbounds = F) & between(value, valueA, valueB, incbounds = F)]
ID num value Mark numA numB valueA valueB
1: E 108 189 Abner 20 326 34 190
2: C 241 280 Trista 237 268 270 385
编辑:
基准此方法与@ Chinsoon12中的非等值内部联接之间的比较表明,即使有更多数据,非等值内部联接也要快得多。这不是一个完美的基准(只是重复 data.table
),但是我仍然认为很明显,非等价内联接的效率要高得多。
EDIT:
Benchmark comparison between this approach and the non-equi inner-join from @Chinsoon12 shows that the non-equi inner-join is much faster with even a little more data. It is not a perfect benchmark (just repeated the data.table
), but I still think it is clear that the non-equi inner-join is much more efficient.
Unit: milliseconds
expr min lq mean median uq max neval
between 233.6378 265.4323 303.14039 301.82455 334.3225 373.2760 10
non_equi_inner 71.6925 74.1547 96.96584 91.14375 97.6664 179.9907 10
基准代码:
DT1 <- data.table(sapply(DT1, rep, 1e3))[, c("num", "value") := lapply(.SD, as.integer), .SDcols = c("num", "value")]
DT2 <- data.table(sapply(DT2, rep, 1e3))[, c("numA", "numB", "valueA", "valueB") := lapply(.SD, as.integer), .SDcols = c("numA", "numB", "valueA", "valueB")]
microbenchmark::microbenchmark(
between = {
DT1[DT2, on = "Mark", allow.cartesian = T][between(num, numA, numB, incbounds = F) & between(value, valueA, valueB, incbounds = F)]
},
non_equi_inner = {
DT1[DT2, on=.(Mark, num>numA, num<numB, value>valueA, value<valueB), nomatch=0L,
.(ID, num=x.num, value=x.value, Mark), allow.cartesian = T]
},
times = 10
)
这篇关于r按行比较两个data.tables的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!