有没有办法有效地使用data.table计数A中的列值落在B的范围内? [英] Is there a way to efficiently count column values in A falling within ranges in B using data.table?

查看:122
本文介绍了有没有办法有效地使用data.table计数A中的列值落在B的范围内?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一些代码来处理以下任务:

I have created some code to handle the following task:

ref = read.table(header=TRUE, text="
user    event
1441    120120102
1441    120120888
1443    120122122
1445    120124452
1445    120123525
1446    120123463", stringsAsFactors=FALSE)

data = read.table(header=TRUE, text="
user    event1        event2
1440    120123432     120156756
1441    120128523     120156545
1441    120123333     120146444
1441    120122344     120122355", stringsAsFactors=FALSE)

我在这里是一个函数用户Carlos Cinelli),它将允许我一行一行地在表 data 上搜索并记录在event1和event2之间夹着ref的多少事件, user id。

What I have here is a function (credit to user Carlos Cinelli) that will allow me to go line by line on the table data and search and record how many events of ref are sandwiched between event1 and event2, by user id.

现在,我想知道是否有更快的方法来执行下面的函数:

Now, I am wondering if there is a faster way to do the function below:

count <- function(x,y,z) ref[, sum(event >=x & event <= y & user ==z)]
data[, count:=mapply(x=event1, y=event2, z=user, count)]

我不能做太多,想知道 data.table 包是否有什么可以帮助使上述更快。非常感谢你!

I haven't been able to do much and was wondering if the data.table package would have anything that can help with making the above faster. Thank you so much!

推荐答案

看看的例子foverlaps 。它们清楚地显示了如何根据其他标识符中的重叠间隔加入。

Have a look at the examples from ?foverlaps. They clearly show how you can join based on overlapping intervals within other identifiers.

require(data.table) ## 1.9.3+
setDT(ref)
setDT(data)

setkey(ref[, event2 := event])
ans = foverlaps(data, ref, by.x=c("user", "event1", "event2"), which=TRUE, nomatch=0L)


$ b b

您的示例尤其糟糕,因为有无重叠。所以我不能真正展示接下来的几个步骤。但 ans 应该为您提供重叠的行索引 ref yid data xid )中的每一行。并且在 user 中获得重叠,因为它也被设置为键列。

Your example is particularly bad because there are no overlaps. So I can't really demonstrate the next few steps. But ans should provide you with overlapping row indices of ref (yid) for each row in data (xid). And the overlaps are obtained within user - since it was set as a key column as well.

我希望你能从这里拿到...如果你发现这不能解决,请发布一个例子,我可以运行来重现你遇到的同一个问题。

I hope you can take it from here... If you find this doesn't resolve, please post an example that I can run to reproduce the same issue you're running into.

HTH

这篇关于有没有办法有效地使用data.table计数A中的列值落在B的范围内?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆