使用 dplyr 进行交互频率计数,必须包括零计数 [英] Using dplyr for frequency counts of interactions, must include zero counts

查看:26
本文介绍了使用 dplyr 进行交互频率计数,必须包括零计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题涉及使用 R 中的 dplyr 包编写代码

My question involves writing code using the dplyr package in R

我有一个相对较大的数据框(大约 500 万行),有 2 列:第一个带有单独的标识符(id),第二个带有日期(date)>).目前,每一行都表示在日期列中的日期发生了一个动作(由 id 列中的个人采取).大约有 300,000 个独特的个体,以及大约 2600 个独特的日期.比如开头的数据是这样的:

I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id), and a second with a date (date). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this:

    id         date
    John12     2006-08-03
    Tom2993    2008-10-11
    Lisa825    2009-07-03
    Tom2993    2008-06-12
    Andrew13   2007-09-11

我想重新整理数据,以便为每个可能的 id x date 对分配一行,并附加一列计算事件总数在给定日期为所列个人发生的事件(可能取值为 0).

I'd like to reshape the data so that I have a row for every possible id x date pair, with an additional column which counts the total number of events that occurred (perhaps taking the value 0) for the listed individual on the given date.

我使用 dplyr 包取得了一些成功,我用它来将数据中观察到的 id x 日期计数制成表格.

I've had some success with the dplyr package, which I've used to tabulate the id x date counts which are observed in the data.

这是迄今为止我用来制表 id x date 计数的代码:(我的数据框称为 df)

Here's the code I've used to tabulate id x date counts so far: (my dataframe is called df)

reduced = df %.% 
  group_by(id, date) %.%
  summarize(length(date))

我的问题是(正如我上面所说的)我想要一个数据集,其中还包含 0 表示没有任何关联操作的 id x 日期对.例如,如果在 2007 年 10 月 10 日没有观察到 John12 的操作,我希望输出为该 id x date 对返回一行,并带有计数为 0.

My problem is that (as I said above) I'd like to have a dataset that also includes 0s for id x date pairs that don't have any associated actions. For example, if there's no observed action for John12 on 2007-10-10, I'd like the output to return a row for that id x date pair, with a count of 0.

我考虑过创建上面的框架,然后与一个空框架合并,但我相信一定有一个更简单的解决方案.任何建议非常感谢!

I considered creating the frame above, then mergine with an empty frame, but I'm convinced there must be a simpler solution. Any suggestions much appreciated!

推荐答案

这是一个简单的选项,使用 data.table 代替:

Here's a simple option, using data.table instead:

library(data.table)

dt = as.data.table(your_df)

setkey(dt, id, date)

# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
#          id       date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6:   John12 2006-08-03 1
# 7:   John12 2007-09-11 0
# 8:   John12 2008-06-12 0
# 9:   John12 2008-10-11 0
#10:   John12 2009-07-03 0
#11:  Lisa825 2006-08-03 0
#12:  Lisa825 2007-09-11 0
#13:  Lisa825 2008-06-12 0
#14:  Lisa825 2008-10-11 0
#15:  Lisa825 2009-07-03 1
#16:  Tom2993 2006-08-03 0
#17:  Tom2993 2007-09-11 0
#18:  Tom2993 2008-06-12 1
#19:  Tom2993 2008-10-11 1
#20:  Tom2993 2009-07-03 0

在版本 1.9.2 或之前的等效表达式省略了显式的 by:

In versions 1.9.2 or before the equivalent expression omits the explicit by:

dt[CJ(unique(id), unique(date)), .N]

这个想法是创建所有可能的 iddate 对(这是 CJ 部分所做的),然后将其合并返回,计算出现次数.

The idea is to create all possible pairs of id and date (which is what the CJ part does), and then merge it back, counting occurrences.

这篇关于使用 dplyr 进行交互频率计数,必须包括零计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆