计算与当前行相对应的符合条件的行 [英] Count rows matching a criteria relative to current row

查看:81
本文介绍了计算与当前行相对应的符合条件的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的数据框(但实际上有约40万行):

I have a dataframe structured like this (but it actually has ~400k rows):

library(data.table)
df <- fread("    id     start     end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
 27677 2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
 74744 2016-10-03 2016-11-05
174095 2018-12-01 2018-12-20
 27677 2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31")

(编辑,感谢Uwe)

对于每一行,我要计算数据框中有多少行具有相同的id作为当前行以及与当前行中的时间段重叠的开始-结束时间段。例如,对于第一行,结果将为2,因为还有另一行id = 174095,并且其结尾大于第一行的开头。

For each row, I want to count how many rows in the dataframe have the same id as the current row and a start-end period that overlaps the period in the current row. For example, for the first row, the result would be 2, since there is another row with id = 174095 and its end is greater than first row start.

用dplyr的行方式执行操作,例如:

I tried to do it with dplyr's rowwise, like:

df = df %>% rowwise() %>% mutate(count = sum(id == df$id & ((start >= df$start & start <= df$end) | (end >= df$start & end <= df$end))))

但这非常慢。我尝试了一下,两个小时后它仍在运行。

But this is extremely slow. I gave it a try and two hours later it was still running.

我也尝试使用mapply,但它也花费了太多时间:

I also tried to use mapply, but it takes way too much time as well:

df$count = mapply(function(id, start, end) {
return(sum(df$id == id & (between(df$start, start, end) | between(df$end, start, end))) }, id, start, end)

是否有一种有效的合理方法来做到这一点?

Is there an efficient reasonable way to do this?

非常感谢

编辑2019-03-06

EDIT 2019-03-06

@Uwe建议的解决方案:

@Uwe 's suggested solution:

df[, overlapping.rows := df[.SD, on = .(id, start <= end, end >= start), .N, by = .EACHI]$N][]

可以很好地处理示例数据.frame以上。但事实证明该示例还不够说明,或者我并没有真正让我自己理解:)

works just fine for the sample data.frame above. But it turns out the sample wasn't illustrative enough, or I didn't really make myself understood maybe :)

我为id添加了第三条记录174095,并修改了其他两个:

I added a third record for id 174095 and modified the other two:

df <- fread("id     start     end
174095 2018-12-19 2018-12-31
            227156 2018-12-19 2018-12-31
            210610 2018-04-13 2018-09-27
            27677 2018-04-12 2018-04-26
            370474 2017-07-13 2017-08-19
            303693 2017-02-20 2017-04-09
            74744 2016-10-03 2016-11-05
            174095 2018-12-01 2018-12-18
            27677 2018-03-01 2018-05-29
            111111 2018-01-01 2018-01-31
            111111 2018-11-11 2018-12-31
            174095 2018-11-30 2018-12-25")

现在,id 174095有两个彼此不重叠的间隔(行1和2),和另一个与其他两个重叠的间隔(行3):

Now, id 174095 has two intervals that do not overlap between them (rows 1 and 2) and another interval that overlaps the other two (row 3):

           id      start        end
1: 174095 2018-12-19 2018-12-31
2: 174095 2018-12-01 2018-12-18
3: 174095 2018-11-30 2018-12-25

因此,结果应为:

       id      start        end overlapping.rows
1: 174095 2018-12-19 2018-12-31                2
2: 174095 2018-12-01 2018-12-18                2
3: 174095 2018-11-30 2018-12-25                3

但实际上是:

       id      start        end overlapping.rows
1: 174095 2018-12-19 2018-12-31                3
2: 174095 2018-12-01 2018-12-18                3
3: 174095 2018-11-30 2018-12-25                3

如果我没记错的话,这是因为最后的连接仅由 id完成,所以所有具有相同id的行都具有相同的结果。

If I'm not mistaken, this is happening because the final join is done by "id" only, so all the rows with the same id have the same result.

我的解决方案包括通过开始和结束执行最终合并:

My solution consists on performing the final merge also by "start" and "end":

df[tmp, on = .(id, start, end), overlapping.rows := N]

由于某种原因(我很想找出...),在自我加入中,开始日期以结束列结尾,反之亦然,所以我必须在其后添加此行:

For some reason (I would love to find out...), on the self-join, start dates end up in the "end" column and vice-versa, so I had to add this line right after it:

setnames(tmp, c("id", "end", "start", "N"))

现在,结果是:

            id      start        end overlapping.rows
 1: 174095 2018-12-19 2018-12-31                2
 2: 227156 2018-12-19 2018-12-31                1
 3: 210610 2018-04-13 2018-09-27                1
 4:  27677 2018-04-12 2018-04-26                2
 5: 370474 2017-07-13 2017-08-19                1
 6: 303693 2017-02-20 2017-04-09                1
 7:  74744 2016-10-03 2016-11-05                1
 8: 174095 2018-12-01 2018-12-18                2
 9:  27677 2018-03-01 2018-05-29                2
10: 111111 2018-01-01 2018-01-31                1
11: 111111 2018-11-11 2018-12-31                1
12: 174095 2018-11-30 2018-12-25                3

这正是我所期望的!

推荐答案

编辑2019-03-07以处理OP的扩展数据集

这可以通过聚合非等式自连接

library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# non-equi self-join and aggregate
tmp <- df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]
# append counts to original dataset
df[, overlapping.rows := tmp$N]
df



        id      start        end overlapping.rows
 1: 174095 2018-12-19 2018-12-31                2
 2: 227156 2018-12-19 2018-12-31                1
 3: 210610 2018-04-13 2018-09-27                1
 4:  27677 2018-04-12 2018-04-26                2
 5: 370474 2017-07-13 2017-08-19                1
 6: 303693 2017-02-20 2017-04-09                1
 7:  74744 2016-10-03 2016-11-05                1
 8: 174095 2018-12-01 2018-12-18                2
 9:  27677 2018-03-01 2018-05-29                2
10: 111111 2018-01-01 2018-01-31                1
11: 111111 2018-11-11 2018-12-31                1
12: 174095 2018-11-30 2018-12-25                3


使用链接代码可以用更紧凑的方式编写,但还有一种更复杂的方法:

Using data.table chaining the code can be written in a more compact but also more convoluted way:

library(data.table)
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols][
  , overlapping.rows := df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]$N][]

请注意,将结果附加到原始 df 的部分基于弗兰克的评论

Note that the part to append the results to the original df is based on Frank's comment.

如果相同的<$ c $有不同的计数,则我最初尝试使用第二次联接将结果附加到原始 df 失败c> id 作为由OP指出。可以通过在第二个联接中包括行号来解决此问题:

My original attempt to use a second join to append the results to the original df failed in case there are different counts for the same id as pointed out by the OP. This can be fixed by including the row number in the second join:

library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# append row number
tmp <- df[, rn := .I][
  # non-equi self-join and aggregate
  df, on = .(id, start <= end, end >= start), .(rn = i.rn, .N), by = .EACHI]
# append counts to original dataset by joining on row number
df[tmp, on = "rn", overlapping.rows := N][, rn := NULL]
df



        id      start        end overlapping.rows
 1: 174095 2018-12-19 2018-12-31                2
 2: 227156 2018-12-19 2018-12-31                1
 3: 210610 2018-04-13 2018-09-27                1
 4:  27677 2018-04-12 2018-04-26                2
 5: 370474 2017-07-13 2017-08-19                1
 6: 303693 2017-02-20 2017-04-09                1
 7:  74744 2016-10-03 2016-11-05                1
 8: 174095 2018-12-01 2018-12-18                2
 9:  27677 2018-03-01 2018-05-29                2
10: 111111 2018-01-01 2018-01-31                1
11: 111111 2018-11-11 2018-12-31                1
12: 174095 2018-11-30 2018-12-25                3



< h3>说明

非等额联接中的联接条件可以解决问题。如果第一个间隔在第二个间隔开始之前结束,或者第一个间隔在第二个间隔结束之后开始,则两个间隔不重叠。

Explanation

The join condition in the non-equi join does the trick. Two intervals do not overlap if the first one ends before the second one starts or the first interval starts after the second interval has ended,

e 1 < s 2 或e 2 < s 1

e1 < s2 OR e2 < s1

现在,如果两个间隔 do 相交/交叠,则上述条件必须相反。否定并应用德摩根定律,我们得到条件

Now, if two intervals do intersect/overlap then the opposite of the above must be true. By negating and applying De Morgan's law we get the conditions

s 2 < = e 1 AND e 2 > = s 1

s2 <= e1 AND e2 >= s1

,用于非装备联接

OP的扩展数据集,如OP的EDIT 2019-03-06中所述:

OP's expanded dataset as described in OP's EDIT 2019-03-06:

library(data.table)
df <- fread("id     start     end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
27677  2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
74744  2016-10-03 2016-11-05
174095 2018-12-01 2018-12-18
27677  2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31
174095 2018-11-30 2018-12-25")

这篇关于计算与当前行相对应的符合条件的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆