使用dplyr进行交互频率计数,必须包括零计数 [英] Using dplyr for frequency counts of interactions, must include zero counts

查看:248
本文介绍了使用dplyr进行交互频率计数,必须包括零计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题涉及使用R中的 dplyr 包编写代码

我有一个相对较大的数据帧(约500万行),其中2列:第一个带有个人标识符( id ),第二个带有日期( date )。目前,每一行表示在日期列中的日期发生一个动作(由id列中的个人采取)。约有30万个独立的个人,约2600个独特的日期。例如,数据的开始如下所示:

  id date 
John12 2006-08-03
Tom2993 2008-10-11
Lisa825 2009-07-03
Tom2993 2008-06-12
Andrew13 2007-09-11
/ pre>

我想重塑数据,以便我可以为每个可能的 id x date 对,附加列可以计算指定日期中列出的个人的事件总数(可能取0)。



我已经通过 dplyr 包取得了一些成功,我用来列出在数据。



以下是我用于列出 id的代码 x date count到目前为止(我的数据框称为 df

  reduced = df%。 %
group_by(id,date)%。%
总结(长度(日期))


$ b $我的问题是(如上所述),我想要一个数据集,其中还包括没有任何关联操作的id x日期对的0。例如,如果John12在2007-10-10没有观察到的操作,我希望输出返回一个 id x 日期对,计数为0.



我认为创建了上面的框架,然后与一个空框架合并,但我相信必须有一个更简单的解决方案。任何建议非常感谢!

解决方案

这是一个简单的选项,使用 data.table / p>

  library(data.table)

dt = as.data.table(your_df)

setkey(dt,id,date)

#在版本1.9.3+
dt [CJ(unique(id),unique(date)), by = .EACHI]
#id日期N
#1:Andrew13 2006-08-03 0
#2:Andrew13 2007-09-11 1
#3:Andrew13 2008 -06-12 0
#4:Andrew13 2008-10-11 0
#5:Andrew13 2009-07-03 0
#6:John12 2006-08-03 1
#7:John12 2007-09-11 0
#8:John12 2008-06-12 0
#9:John12 2008-10-11 0
#10:John12 2009-07 -03 0
#11:Lisa825 2006-08-03 0
#12:Lisa825 2007-09-11 0
#13:Lisa825 2008-06-12 0
# 14:Lisa825 2008-10-11 0
#15:Lisa825 2009-07-03 1
#16:Tom2993 2006-08-03 0
#17:Tom2993 2007-09-11 0
#18:Tom2993 2008-06-12 1
#19:Tom2993 2008-10-11 1
#20:Tom2993 2009-07-03 0

在1.9.2版本之前或之前,等效表达式将省略:

  dt [CJ(unique(id),unique(date)),.N] 

想法是创建所有可能的一对 id date (这是什么 CJ 部分),然后将其合并,计数事件。


My question involves writing code using the dplyr package in R

I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id), and a second with a date (date). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this:

    id         date
    John12     2006-08-03
    Tom2993    2008-10-11
    Lisa825    2009-07-03
    Tom2993    2008-06-12
    Andrew13   2007-09-11

I'd like to reshape the data so that I have a row for every possible id x date pair, with an additional column which counts the total number of events that occurred (perhaps taking the value 0) for the listed individual on the given date.

I've had some success with the dplyr package, which I've used to tabulate the id x date counts which are observed in the data.

Here's the code I've used to tabulate id x date counts so far: (my dataframe is called df)

reduced = df %.% 
  group_by(id, date) %.%
  summarize(length(date))

My problem is that (as I said above) I'd like to have a dataset that also includes 0s for id x date pairs that don't have any associated actions. For example, if there's no observed action for John12 on 2007-10-10, I'd like the output to return a row for that id x date pair, with a count of 0.

I considered creating the frame above, then mergine with an empty frame, but I'm convinced there must be a simpler solution. Any suggestions much appreciated!

解决方案

Here's a simple option, using data.table instead:

library(data.table)

dt = as.data.table(your_df)

setkey(dt, id, date)

# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
#          id       date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6:   John12 2006-08-03 1
# 7:   John12 2007-09-11 0
# 8:   John12 2008-06-12 0
# 9:   John12 2008-10-11 0
#10:   John12 2009-07-03 0
#11:  Lisa825 2006-08-03 0
#12:  Lisa825 2007-09-11 0
#13:  Lisa825 2008-06-12 0
#14:  Lisa825 2008-10-11 0
#15:  Lisa825 2009-07-03 1
#16:  Tom2993 2006-08-03 0
#17:  Tom2993 2007-09-11 0
#18:  Tom2993 2008-06-12 1
#19:  Tom2993 2008-10-11 1
#20:  Tom2993 2009-07-03 0

In versions 1.9.2 or before the equivalent expression omits the explicit by:

dt[CJ(unique(id), unique(date)), .N]

The idea is to create all possible pairs of id and date (which is what the CJ part does), and then merge it back, counting occurrences.

这篇关于使用dplyr进行交互频率计数,必须包括零计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆