基于在滚动日期内存在的条件创建新列 [英] Create new column based on condition that exists within a rolling date

查看:195
本文介绍了基于在滚动日期内存在的条件创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了使这个问题更加一般化,我相信它也可以改写为:创建滚动时间敏感因子变量。虽然不常见的要求,这可以用于许多不同的数据源。

To make this question more generalized, I believe it could also be rephrased as: Creating a rolling temporally sensitive factor variable. Though an uncommon requirement, this could be utilized for many different data sources.

我有一系列非均匀时间数据每天为成千上万的用户提供> 1条记录。我想创建一个新列 player_type ,跟踪一个滚动30天的行为定义。行为是由他们玩什么游戏来定义的; 'games'是游戏A,gameB的因子。

I have a series of non-uniform time data with > 1 record per day for thousands of users. I want to create a new column player_type that keeps track of a rolling 30 day definition of their behavior. The behavior is defined by what games they play; the column 'games' is a factor of gameA, gameB.

因此,有三种类型的行为: / p>

There are thus three types of behaviors:


  1. 独家播放GameA - 'A'

  2. 独家玩GameB - 'B'

  3. 玩两个游戏 - 'Hybrid'

  1. Exclusively plays GameA - 'A'
  2. Exclusively plays GameB - 'B'
  3. Plays both games - 'Hybrid'

我想使用此新列来查看他们在一段时间内的游戏行为的变化,

I want to use this new column to see the changes in their play behavior over time, as well as counting the number of players in each group throughout time, to see how they change.

每个玩家的时间序列非常不规则。玩家可以玩多个每天的游戏类型,或者不玩任何游戏多个月。时间序列对于每个玩家是不规则的,使得记录仅在玩家玩游戏时创建,因此我期望解决方案可以使用过滤器:

The time series is highly irregular for each player. Players can play multiple types of games per day, or not play any games for many months. The time series is irregular per player such that a record is only created when the player plays a game, thus I expect a solution might use a filter something like:

interval(current_date,current_date - new_period(days = 30)(使用lubridate)。

interval(current_date, current_date - new_period(days=30) (using lubridate).

记住这是简化和测试滚动1天的变化,所以简单的方法检查记录之前将不会实际工作。
如果你能够做一个更好的数据集,请指教,我会编辑此帖子。

Here is an example data set. Keep in mind this it is simplified and tests a rolling 1 day change, so simple methods checking the record before will not actually work. If you are able to make a better data set, please advise and I will edit this post.

p <- c( 1,   1,   1,   2,   2,   2,   6,   6,   6)

g <- c('A', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'B')

d <- seq(as.Date('2014-10-01'), as.Date('2014-10-9'), by=1)

df <- data.frame(player_id = p, date = d, games = g)

我需要:

 player_id       date games   type
1         1 2014-10-01     A      A (OR NA)
2         1 2014-10-02     B Hybrid
3         1 2014-10-03     B      B
4         2 2014-10-04     A      A (OR NA)
5         2 2014-10-05     B Hybrid
6         2 2014-10-06     A Hybrid
7         6 2014-10-07     A      A (OR NA)
8         6 2014-10-08     B Hybrid
9         6 2014-10-09     B      B

解决方案应该类似于 code>通过列,并应用一个检查30天时间的函数和 ifelse()语句来查看他们玩的是什么游戏。

The solution should be something like, apply through the columns, and apply a function which checks back 30 days in time, and an ifelse() statement to see what games they played.

这是一个非常相似的帖子 - 应该有助于解决这个问题。 我如何做一个条件总和只在某些日期条件之间看

This is a very similar post - and should help solve this problem. How do I do a conditional sum which only looks between certain date criteria

我也探讨了 rowwise()和条件 mutates()使用dplyr,但catch是我的历史时间组件。

I have also explored, rowwise() and conditional mutates() using dplyr, however the catch is the historical time component for me.

帮助!我不能感谢这个论坛够了。我会经常回来。

Thanks for all the help! I can't thank this forum enough. I'll be checking back frequently.

推荐答案

假设我理解它,这里是一个数据。使用 foverlaps()函数创建表。

Assuming that I understood it right, here's a data.table way using foverlaps() function.

dt 并设置键,如下所示:

Create dt and set key as shown below:

dt <- data.table(player_id = p, games = g, date = d, end_date = d)
setkey(dt, player_id, date, end_date)

hybrid_index <- function(dt, roll_days) {
    ivals = copy(dt)[, date := date-roll_days]
    olaps = foverlaps(ivals, dt, type="any", which=TRUE)
    olaps[, val := dt$games[xid] != dt$games[yid]]
    olaps[, any(val), by=xid][(V1), xid]
}

我们创建一个dummy data.table ivals 我们指定开始结束日期。请注意,通过将 end_date 指定为 dt $ end_date ,我们肯定会有一个匹配(这是故意的)您要求的非NA版本。

We create a dummy data.table ivals (for intervals), and for each row, we specify the start and the end dates. Note that by specifying end_date identical as dt$end_date, we'll definitely have one match (and this is deliberate) - this'll give you the non-NA version you ask for.

[这里有一些细微的变化,你可以得到 NA 版本, ]假设这个答案是正确的。]

[With some minor changes here, you can get the NA version, but I'll leave that to you (assuming this answer is right).]

我们只需要找到 ivals c $ c> dt player_id 。我们得到匹配的索引。从那里它是直接。如果玩家的游戏是非均匀的,那么我们从 hybrid_index 中返回 dt 的相应索引。我们用混合替换这些指数。

With that we simply find which ranges from ivals overlaps with dt, for each player_id. We get the matching indices. From there it's straightforward. If a player's game is non-homogeneous, then we return the corresponding index of dt from hybrid_index. And we replace those indices with "hybrid".

# roll days = 1L
dt[, type := games][hybrid_index(dt, 1L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03      B
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09      B

# roll days = 2L
dt[, type := games][hybrid_index(dt, 2L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03 hybrid
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09 hybrid

为了清楚地说明这个想法,我创建了一个函数,并在函数中复制了 dt 。但是您可以避免这种情况,并将 ivals 中的日期直接添加到 dt ,并使用 foverlaps()中的by.x by.y 参数。请查看?foverlaps

To illustrate the idea clearly, I've created a function and copied dt inside the function. But you can avoid that and add the dates in ivals directly to dt and make use of by.x and by.y arguments in foverlaps(). Please look at ?foverlaps.

这篇关于基于在滚动日期内存在的条件创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆