条件数据表与.EACHI合并 [英] Conditional data.table merge with .EACHI

查看:73
本文介绍了条件数据表与.EACHI合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用更新的 data.table 条件合并功能,它非常酷。我有一种情况,我有两个表 dtBig dtSmall ,并且当两个数据集中有多个行匹配时,此条件合并发生。有没有一种方法可以使用 max min 这样的函数来汇总这些匹配项?这是一个可复制的示例,试图模仿我要完成的任务。

I have been playing around with the newer data.table conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig and dtSmall, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max or min for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.

## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")



创建两个假数据集



A创建一个包含50行的大表(每个ID 10个值)。

Create two fake datasets

A create a "big" table with 50 rows (10 values for each ID).

library(data.table)
set.seed(1L)

# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]

    ID ValueBig Rank
 1:  A      266    3
 2:  A      373    4
 3:  A      573    5
 4:  A      909    9
 5:  A      202    2
---                 
46:  E      790    9
47:  E       24    1
48:  E      478    2
49:  E      733    7
50:  E      693    6

创建类似的小型数据集到第一个,但有10行(每个ID 2个值)

Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)

dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))

    ID ValueSmall
 1:  A        478
 2:  A        862
 3:  B        439
 4:  B        245
 5:  C         71
 6:  C        100
 7:  D        317
 8:  D        519
 9:  E        663
10:  E        407



合并



我接下来想通过 ID 进行合并,只需要合并其中 ValueSmall 大于或等于 ValueBig 。对于比赛,我想获取 dtBig 中的 max 排名值。我尝试过两种不同的方式。方法2给了我想要的输出,但是我不清楚为什么输出完全不同。

Merge

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.

## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]

## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]



结果



Results

    ID ValueSmall RankSmall RankSmall2 DesiredRank
 1:  A        478         1          4           4
 2:  A        862         1          7           7
 3:  B        439         3          4           4
 4:  B        245         1          2           2
 5:  C         71         1          1           1
 6:  C        100         1          1           1
 7:  D        317         1          2           2
 8:  D        519         3          5           5
 9:  E        663         2          5           5
10:  E        407         1          1           1

是否有更好的 data.table 如何在另一个具有多个匹配项的 data.table 中获取最大值的方法?

Is there a better data.table way of grabbing the max value in another data.table with multiple matches?

推荐答案


我接下来要按ID执行合并,并且仅在ValueSmall大于或等于ValueBig的情况下才需要合并。对于比赛,我想在dtBig中获取最大排名值。

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.



setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
  dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]

    ID ValueSmall r
 1:  A        478 4
 2:  A        862 7
 3:  B        439 4
 4:  B        245 2
 5:  C         71 1
 6:  C        100 1
 7:  D        317 2
 8:  D        519 5
 9:  E        663 5
10:  E        407 1

我想对dtBig进行排序并采用最后一个匹配行要快得多而不是通过.EACHI计算最大值,但不能完全确定。如果您不喜欢排序,只需保存以前的排序顺序,以便以后可以还原。

I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EAC but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.


是否可以使用诸如max或min之类的函数针对这些多个匹配项来汇总这些匹配项?

Is there a way to aggregate these matches using a function like max or min for these multiple matches?

为此更普遍的问题是,.EACHI可以正常工作,只是确保您正在对目标表的每一行都进行此操作(在这种情况下为dtSmall),所以...

For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...

dtSmall[, r :=
  dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

这篇关于条件数据表与.EACHI合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆