按 R 中的因子矢量化 cumsum [英] vectorize cumsum by factor in R

查看:33
本文介绍了按 R 中的因子矢量化 cumsum的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在一个非常大的数据框(约 220 万行)中创建一列,用于计算每个因子级别的 1 的累积总和,并在达到新的因子级别时重置.下面是一些与我自己相似的基本数据.

I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates the cumulative sum of 1's for each factor level, and resets when a new factor level is reached. Below is some basic data that resembles my own.

itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6')
goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1)
df <- data.frame(itemcode, goodp)

我希望输出变量 cum.goodp 如下所示:

I would like the output variable, cum.goodp, to look like this:

cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)

我知道有很多使用规范的拆分-应用-组合方法,这在概念上是直观的,但我尝试使用以下方法:

I get that there is a lot out there using the canonical split-apply-combine approach, which, conceptually is intuitive, but I tried using the following:

k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))

当我尝试运行此代码时,速度非常慢.我知道转换是原因的一部分(by"也无济于事).itemcode 变量有超过 70K 个不同的值,因此可能应该对其进行矢量化.有没有办法使用 cumsum 将其矢量化?如果没有,任何帮助将不胜感激.非常感谢.

When I try to run this code it's very very slow. I get that transform is part of the reason why (the 'by' doesn't help either). There are over 70K different values for the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this, using cumsum? If not, any help whatsoever would be truly appreciated. Thanks so much.

推荐答案

通过修改后的示例输入/输出,您可以使用以下基本 R 方法(以及其他方法):

With the modified example input/output you could use the following base R approach (among others):

transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
#   itemcode goodp cum.goodp cum.goodpX
#1        a1     0         0          0
#2        a1     1         1          1
#3        a1     1         2          2
#4        a1     0         0          0
#5        a1     1         1          1
#6        a2     1         1          1
#7        a2     1         2          2
#8        a3     0         0          0
#9        a4     0         0          0
#10       a4     1         1          1
#11       a5     1         1          1
#12       a6     1         1          1
#13       a6     1         2          2
#14       a6     0         0          0
#15       a6     1         1          1

注意:我将列 cum.goodp 添加到输入 df 并创建了一个新列 cum.goodpX 以便您可以轻松比较两个.

Note: I added column cum.goodp to the input df and created a new column cum.goodpX so you can easily compare the two.

但当然,您可以对包使用许多其他方法,无论是@MartinMorgan 建议的方法,还是例如使用 dplyr 或 data.table,仅举两个选项.对于大型数据集,这些可能比基本 R 方法快得多.

But of course you can use many other approaches with packages, either what @MartinMorgan suggested or for example using dplyr or data.table, to name just two options. Those may be a lot faster than base R approaches for large data sets.

以下是在 dplyr 中的实现方式:

Here's how it would be done in dplyr:

library(dplyr)
df %>% 
   group_by(itemcode, grp = cumsum(goodp == 0)) %>% 
   mutate(cum.goodpX = cumsum(goodp))

您的问题的评论中已经提供了 data.table 选项.

A data.table option was already provided in the comments to your question.

这篇关于按 R 中的因子矢量化 cumsum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆