非等式自联接上data.table更新的奇怪行为 [英] Odd behaviour of data.table's update on non-equi self-join

查看:107
本文介绍了非等式自联接上data.table更新的奇怪行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在准备答案时,对问题

While preparing an answer to the question dplyr or data.table to calculate time series aggregations in R I noticed that I do get different results depending on whether the table is updated in place or returned as a new object. Also, I do get different result when I change the order of columns in the non-equi join conditions.

目前,我对此没有任何解释,可能是由于我的主要误解或简单的编码错误造成的.

Currently, I don't have an explanation for this, perhaps due to a major misunderstanding on my side or a simple coding error.

请注意,这个问题特别要求 data.table连接的观察到的行为的解释.如果你 对于潜在的问题有替代解决方案,请感到 免费发布原始问题的答案.

Please, note that this question is asking particularly for explanations of the observed behaviour of data.table joins. If you have alternative solutions to the underlying problem, please, feel free to post an answer to the original question.

原始问题和可行答案

最初的问题是如何使用以下数据计算每个患者在住院前365天(包括实际住院)中发生的住院次数:

Original question and working answer

The original question was how to count the number of hospitalizations occuring in the 365 days before that hospitalization (including the actual one) for each patient using these data:

library(data.table)   # version 1.10.4 (CRAN) or 1.10.5 (devel built 2017-08-19)
DT0 <- data.table(
  patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
  hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", "2016/1/7", 
                                   "2015/12/20", "2015/12/25", "2016/2/10")))
setorder(DT0, patient.id, hospitalization.date)
DT0

   patient.id hospitalization.date
1:          1           2013-10-15
2:          1           2015-07-16
3:          1           2016-01-07
4:          2           2014-10-15
5:          2           2015-12-20
6:          2           2015-12-25
7:          2           2016-02-10

下面的代码给出了预期的答案(为清楚起见,在此处添加了其他帮助器列)

The code below gives the expected answer (additional helper column added here for clarity)

# add helper columns
DT0[, start.date := hospitalization.date - 365][
  , end.date := hospitalization.date][]
DT0

   patient.id hospitalization.date start.date   end.date
1:          1           2013-10-15 2012-10-15 2013-10-15
2:          1           2015-07-16 2014-07-16 2015-07-16
3:          1           2016-01-07 2015-01-07 2016-01-07
4:          2           2014-10-15 2013-10-15 2014-10-15
5:          2           2015-12-20 2014-12-20 2015-12-20
6:          2           2015-12-25 2014-12-25 2015-12-25
7:          2           2016-02-10 2015-02-10 2016-02-10

result <- DT0[DT0, on = c("patient.id", "hospitalization.date>=start.date", 
              "hospitalization.date<=end.date"), 
   .(hospitalizations.last.year = .N), by = .EACHI][]
result

   patient.id hospitalization.date hospitalization.date hospitalizations.last.year
1:          1           2012-10-15           2013-10-15                          1
2:          1           2014-07-16           2015-07-16                          1
3:          1           2015-01-07           2016-01-07                          2
4:          2           2013-10-15           2014-10-15                          1
5:          2           2014-12-20           2015-12-20                          1
6:          2           2014-12-25           2015-12-25                          2
7:          2           2015-02-10           2016-02-10                          3

重命名和重复的列名(保留用于比较)除外.

except for the renamed and duplicated column names (which are left as is for comparison).

对于patient.id == 2,最后一行的结果为3,因为该患者自2015-02-10以来第三次于2016-02-10住院.

For patient.id == 2, the result in last row is 3 because the patient was hospitalized on 2016-02-10 for the third time since 2015-02-10.

result是一个新的data.table对象,它占用更多内存.我尝试使用以下方法在适当的位置更新原始的data.table对象:

result is a new data.table object which occupies additional memory. I tried to update the original data.table object in place using:

# use copy of DT0 which can be safely modified
DT <- copy(DT0)

DT[DT, on = c("patient.id", "hospitalization.date>=start.date", 
            "hospitalization.date<=end.date"), 
   hospitalizations.last.year := .N, by = .EACHI]
DT

   patient.id hospitalization.date start.date   end.date hospitalizations.last.year
1:          1           2013-10-15 2012-10-15 2013-10-15                          1
2:          1           2015-07-16 2014-07-16 2015-07-16                          2
3:          1           2016-01-07 2015-01-07 2016-01-07                          2
4:          2           2014-10-15 2013-10-15 2014-10-15                          1
5:          2           2015-12-20 2014-12-20 2015-12-20                          3
6:          2           2015-12-25 2014-12-25 2015-12-25                          3
7:          2           2016-02-10 2015-02-10 2016-02-10                          3

DT现在已就位更新 ,但第5和第6行现在显示3例住院,而不是1或2例.似乎现在每个行都返回了上一期间的住院总数.

DThas now been updated in place but rows 5 and 6 show now 3 hospitalizations instead of 1 or 2, resp. It seems that now the total number of hospitalizations within the last period is returned for each of the rows.

即使在自联接中,非等参联接条件下的列顺序也很重要:

Also the order of columns in the non-equi join conditions do matter, even in a self-join:

result <- DT0[DT0, on = c("patient.id", "start.date<=hospitalization.date", 
                          "end.date>=hospitalization.date"), 
              .(hospitalizations.last.year = .N), by = .EACHI][]
result

我的期望是"start.date<=hospitalization.date"等同于"hospitalization.date>=start.date"(请注意,同时切换了<>),但结果是

My expectation was that "start.date<=hospitalization.date" would be equivalent to "hospitalization.date>=start.date" (note that also < and > were switched) but the result

   patient.id start.date   end.date hospitalizations.last.year
1:          1 2013-10-15 2013-10-15                          1
2:          1 2015-07-16 2015-07-16                          2
3:          1 2016-01-07 2016-01-07                          1
4:          2 2014-10-15 2014-10-15                          1
5:          2 2015-12-20 2015-12-20                          3
6:          2 2015-12-25 2015-12-25                          2
7:          2 2016-02-10 2016-02-10                          1

是不同的.似乎现在正在计算即将住院的人数

is different. It seems that now the number of forthcoming hospitalizations are being counted

有趣的是,就地更新 确实返回相同的结果(某些列名除外):

Interestingly, the update in place does now return the same result (except for some of the column names):

# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "start.date<=hospitalization.date", 
              "end.date>=hospitalization.date"), 
   hospitalizations.last.year := .N, by = .EACHI]
DT

   patient.id hospitalization.date start.date   end.date hospitalizations.last.year
1:          1           2013-10-15 2012-10-15 2013-10-15                          1
2:          1           2015-07-16 2014-07-16 2015-07-16                          2
3:          1           2016-01-07 2015-01-07 2016-01-07                          1
4:          2           2014-10-15 2013-10-15 2014-10-15                          1
5:          2           2015-12-20 2014-12-20 2015-12-20                          3
6:          2           2015-12-25 2014-12-25 2015-12-25                          2
7:          2           2016-02-10 2015-02-10 2016-02-10                          1

相关

有一个潜在的相关问题导致了有一个 Arun的答案关于将x.前缀与 non-equi joins一起使用.

推荐答案

分组by=.EACHI表示按每个i"而不是按每个x".

The grouping by=.EACHI means "by each i" not "by each x".

# for readability / my sanity
DT = copy(DT0)
setnames(DT, "hospitalization.date", "h.date")

z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   .(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
, by=.EACHI][, utils:::tail.default(.SD, 6)]

      x.h.date patient.id i.start.date i.end.date g N
 1: 2013-10-15          1   2012-10-15 2013-10-15 1 1 * 
 2: 2015-07-16          1   2014-07-16 2015-07-16 2 1 
 3: 2015-07-16          1   2015-01-07 2016-01-07 3 2 *
 4: 2016-01-07          1   2015-01-07 2016-01-07 3 2 *
 5: 2014-10-15          2   2013-10-15 2014-10-15 4 1 *  
 6: 2015-12-20          2   2014-12-20 2015-12-20 5 1
 7: 2015-12-20          2   2014-12-25 2015-12-25 6 2  
 8: 2015-12-25          2   2014-12-25 2015-12-25 6 2 
 9: 2015-12-20          2   2015-02-10 2016-02-10 7 3 *
10: 2015-12-25          2   2015-02-10 2016-02-10 7 3 *
11: 2016-02-10          2   2015-02-10 2016-02-10 7 3 *

对于患者1,分组为

  • .(start.date = 2012-10-15, end.date = 2013-10-15),计数为1
  • .(start.date = 2014-07-16, end.date = 2015-07-16),计数为1
  • .(start.date = 2015-01-07, end.date = 2016-01-07),计数为2
  • .(start.date = 2012-10-15, end.date = 2013-10-15), count of 1
  • .(start.date = 2014-07-16, end.date = 2015-07-16), count of 1
  • .(start.date = 2015-01-07, end.date = 2016-01-07), count of 2

幸运的是,此联接中既有七个组,又在原始表中有七个行.

It is just by luck that there are both seven groups in this join and seven rows in the original table.

对于更棘手的问题,我将在笔记中举一个例子:

For the tougher issue, I'll borrow an example from my notes:

当心一个更新联接中的多个匹配项.当存在多个匹配项时,一个更新联接显然只使用最后一个匹配项.不幸的是,这是默默地完成的.试试:

Beware multiple matches in an update join. When there are multiple matches, an update join will apparently only use the last one. Unfortunately, this is done silently. Try:

a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), 
  t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
b = data.table(id = 1:2, y = c(11L, 15L))
b[a, on=.(id), x := i.x, verbose = TRUE ][]

# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: x,i.x 
# Assigning to 3 row subset of 2 rows
#    id  y  x
# 1:  1 11 12
# 2:  2 15 13

在详细说明中,我们看到有关分配到2行的3行子集"的有用信息.

With verbose on, we see a helpful message about assignment "to 3 row subset of 2 rows."

-修改自快速R教程"部分更新联接"

不幸的是,在OP的情况下,verbose=TRUE不提供 这样的消息.

In the OP's case, verbose=TRUE does not offer such a message, unfortunately.

DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   n := .N, by = .EACHI, verbose=TRUE]
# Non-equi join operators detected ... 
#   forder took ... 0.01 secs
#   Generating group lengths ... done in 0 secs
#   Generating non-equi group ids ... done in 0 secs
#   Found 1 non-equi group(s) ...
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: <none> 
# lapply optimization is on, j unchanged as '.N'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 7 groups
#   eval(j) took 0.000s for 7 calls
# 0.01 secs

但是,我们可以看到每个x组的最后一行确实包含OP看到的值.我已经在上面用星号手动标记了这些.或者,您可以用z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date]标记它们.

However, we can see that the last row per x group does contain the value the OP sees. I've manually marked these with asterisks above. Alternately, you could mark them with z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date].

作为参考,此处的更新联接为...

For reference, the update join here is...

DT[, n := 
  .SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N 
]

   patient.id hospitalization.date start.date   end.date     h.date n
1:          1           2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
2:          1           2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
3:          1           2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
4:          2           2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
5:          2           2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
6:          2           2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
7:          2           2016-02-10 2015-02-10 2016-02-10 2016-02-10 3

这是处理这种情况的正确/惯用方式,即在另一个表中查找x的每一行并计算结果摘要,从而将列添加到x中:

This is the correct/idiomatic way to handle this case, of adding columns to x based on looking up each row of x in another table and computing a summary of the result:

x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]

这篇关于非等式自联接上data.table更新的奇怪行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆