使用 dplyr 进行交互频率计数,必须包括零计数 [英] Using dplyr for frequency counts of interactions, must include zero counts
问题描述
我的问题涉及使用 R 中的 dplyr 包编写代码
My question involves writing code using the dplyr package in R
我有一个相对较大的数据框(大约 500 万行),有 2 列:第一个带有单独的标识符(id
),第二个带有日期(date
)>).目前,每一行都表示在日期列中的日期发生了一个动作(由 id 列中的个人采取).大约有 300,000 个独特的个体,以及大约 2600 个独特的日期.比如开头的数据是这样的:
I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id
), and a second with a date (date
). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this:
id date
John12 2006-08-03
Tom2993 2008-10-11
Lisa825 2009-07-03
Tom2993 2008-06-12
Andrew13 2007-09-11
我想重新整理数据,以便为每个可能的 id
x date
对分配一行,并附加一列计算事件总数在给定日期为所列个人发生的事件(可能取值为 0).
I'd like to reshape the data so that I have a row for every possible id
x date
pair, with an additional column which counts the total number of events that occurred (perhaps taking the value 0) for the listed individual on the given date.
我使用 dplyr 包取得了一些成功,我用它来将数据中观察到的 id x 日期计数制成表格.
I've had some success with the dplyr package, which I've used to tabulate the id x date counts which are observed in the data.
这是迄今为止我用来制表 id
x date
计数的代码:(我的数据框称为 df)
Here's the code I've used to tabulate id
x date
counts so far: (my dataframe is called df)
reduced = df %.%
group_by(id, date) %.%
summarize(length(date))
我的问题是(正如我上面所说的)我想要一个数据集,其中还包含 0 表示没有任何关联操作的 id x 日期对.例如,如果在 2007 年 10 月 10 日没有观察到 John12 的操作,我希望输出为该 id
x date
对返回一行,并带有计数为 0.
My problem is that (as I said above) I'd like to have a dataset that also includes 0s for id x date pairs that don't have any associated actions. For example, if there's no observed action for John12 on 2007-10-10, I'd like the output to return a row for that id
x date
pair, with a count of 0.
我考虑过创建上面的框架,然后与一个空框架合并,但我相信一定有一个更简单的解决方案.任何建议非常感谢!
I considered creating the frame above, then mergine with an empty frame, but I'm convinced there must be a simpler solution. Any suggestions much appreciated!
推荐答案
这是一个简单的选项,使用 data.table
代替:
Here's a simple option, using data.table
instead:
library(data.table)
dt = as.data.table(your_df)
setkey(dt, id, date)
# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
# id date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6: John12 2006-08-03 1
# 7: John12 2007-09-11 0
# 8: John12 2008-06-12 0
# 9: John12 2008-10-11 0
#10: John12 2009-07-03 0
#11: Lisa825 2006-08-03 0
#12: Lisa825 2007-09-11 0
#13: Lisa825 2008-06-12 0
#14: Lisa825 2008-10-11 0
#15: Lisa825 2009-07-03 1
#16: Tom2993 2006-08-03 0
#17: Tom2993 2007-09-11 0
#18: Tom2993 2008-06-12 1
#19: Tom2993 2008-10-11 1
#20: Tom2993 2009-07-03 0
在版本 1.9.2 或之前的等效表达式省略了显式的 by
:
In versions 1.9.2 or before the equivalent expression omits the explicit by
:
dt[CJ(unique(id), unique(date)), .N]
这个想法是创建所有可能的 id
和 date
对(这是 CJ
部分所做的),然后将其合并返回,计算出现次数.
The idea is to create all possible pairs of id
and date
(which is what the CJ
part does), and then merge it back, counting occurrences.
这篇关于使用 dplyr 进行交互频率计数,必须包括零计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!