根据滚动日期中存在的条件创建新列 [英] Create new column based on condition that exists within a rolling date

查看:97
本文介绍了根据滚动日期中存在的条件创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了使这个问题更加笼统,我相信它也可以改写为:创建一个滚动的时间上敏感的因子变量。虽然这是一个不寻常的要求,但可以用于许多不同的数据源。

To make this question more generalized, I believe it could also be rephrased as: Creating a rolling temporally sensitive factor variable. Though an uncommon requirement, this could be utilized for many different data sources.

我有一系列不均匀的时间数据每天> 1个记录,成千上万的用户。我想创建一个新列 player_type ,可以跟踪他们行为的滚动30天定义。这个行为是由他们玩的游戏定义的列'games'是gameA,gameB的一个因素。

I have a series of non-uniform time data with > 1 record per day for thousands of users. I want to create a new column player_type that keeps track of a rolling 30 day definition of their behavior. The behavior is defined by what games they play; the column 'games' is a factor of gameA, gameB.

因此有三种类型的行为: / p>

There are thus three types of behaviors:


  1. 独家玩GameA - 'A'

  2. 独家玩GameB - 'B'

  3. 播放两个游戏 - 'Hybrid'

  1. Exclusively plays GameA - 'A'
  2. Exclusively plays GameB - 'B'
  3. Plays both games - 'Hybrid'

我想使用这个新的列来查看随着时间的推移,他们的游戏行为的变化,以及计数每个组中的玩家在整个时间内,看看他们是如何改变的。

I want to use this new column to see the changes in their play behavior over time, as well as counting the number of players in each group throughout time, to see how they change.

时间序列对于每个玩家而言是非常不规则的。玩家可以玩多个每天的游戏类型,或者不玩几个月的游戏。时间序列是每个玩家不规则的,所以记录仅在玩家玩游戏时创建,因此我希望解决方案可能会使用以下过滤器:

The time series is highly irregular for each player. Players can play multiple types of games per day, or not play any games for many months. The time series is irregular per player such that a record is only created when the player plays a game, thus I expect a solution might use a filter something like:

interval(current_date,current_date - new_period(days = 30)(使用lubridate)。

interval(current_date, current_date - new_period(days=30) (using lubridate).

这是一个示例数据集。请记住,这是简化的,并测试滚动1天的变化,所以检查记录之前的简单方法将不会实际工作
如果您能够制作更好的数据集,请指教,我将编辑此信息。

Here is an example data set. Keep in mind this it is simplified and tests a rolling 1 day change, so simple methods checking the record before will not actually work. If you are able to make a better data set, please advise and I will edit this post.

p <- c( 1,   1,   1,   2,   2,   2,   6,   6,   6)

g <- c('A', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'B')

d <- seq(as.Date('2014-10-01'), as.Date('2014-10-9'), by=1)

df <- data.frame(player_id = p, date = d, games = g)

我需要:

 player_id       date games   type
1         1 2014-10-01     A      A (OR NA)
2         1 2014-10-02     B Hybrid
3         1 2014-10-03     B      B
4         2 2014-10-04     A      A (OR NA)
5         2 2014-10-05     B Hybrid
6         2 2014-10-06     A Hybrid
7         6 2014-10-07     A      A (OR NA)
8         6 2014-10-08     B Hybrid
9         6 2014-10-09     B      B

解决方案应该是像应用通过列,并应用一个功能,检查30天的时间,一个 ifelse()语句来查看他们玩的游戏。

The solution should be something like, apply through the columns, and apply a function which checks back 30 days in time, and an ifelse() statement to see what games they played.

这是一个非常相似的帖子,应该有助于解决这个问题。 如何做一个仅在某些日期标准之间的有条件的总和

This is a very similar post - and should help solve this problem. How do I do a conditional sum which only looks between certain date criteria

我还探讨了 rowwise() mutates()使用dplyr,但是catch是我的历史时间组件。

I have also explored, rowwise() and conditional mutates() using dplyr, however the catch is the historical time component for me.

感谢所有的帮助!我不能够感谢这个论坛。我会经常检查。

Thanks for all the help! I can't thank this forum enough. I'll be checking back frequently.

推荐答案

假设我理解正确,这里是一个数据。表使用 foverlaps()函数的方式。

Assuming that I understood it right, here's a data.table way using foverlaps() function.

创建 dt 并设置键如下所示:

Create dt and set key as shown below:

dt <- data.table(player_id = p, games = g, date = d, end_date = d)
setkey(dt, player_id, date, end_date)

hybrid_index <- function(dt, roll_days) {
    ivals = copy(dt)[, date := date-roll_days]
    olaps = foverlaps(ivals, dt, type="any", which=TRUE)
    olaps[, val := dt$games[xid] != dt$games[yid]]
    olaps[, any(val), by=xid][(V1), xid]
}

我们为每个行创建一个虚拟的data.table ivals 我们指定开始结束日期。请注意,通过将 end_date 指定为相同的 dt $ end_date ,我们一定会有一个匹配(这是故意的) - 这会给你你要求的非NA版本。

We create a dummy data.table ivals (for intervals), and for each row, we specify the start and the end dates. Note that by specifying end_date identical as dt$end_date, we'll definitely have one match (and this is deliberate) - this'll give you the non-NA version you ask for.

[这里有一些细微的变化,你可以得到 NA 版本,但我会把它留给你(假设这个答案是正确的)。]

[With some minor changes here, you can get the NA version, but I'll leave that to you (assuming this answer is right).]

我们只需从 ivals 中与 dt 每个 player_id 。我们得到匹配的索引。从那里直截了当。如果玩家的游戏是非均匀的,那么我们从 hybrid_index 中返回相应的 dt 的索引。我们用混合代替这些指标。

With that we simply find which ranges from ivals overlaps with dt, for each player_id. We get the matching indices. From there it's straightforward. If a player's game is non-homogeneous, then we return the corresponding index of dt from hybrid_index. And we replace those indices with "hybrid".

# roll days = 1L
dt[, type := games][hybrid_index(dt, 1L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03      B
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09      B

# roll days = 2L
dt[, type := games][hybrid_index(dt, 2L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03 hybrid
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09 hybrid

为了清楚地说明这个想法,我创建了一个函数,并在函数内复制了 dt 。但是您可以避免这种情况,并将 ivals 中的日期直接添加到 dt 并使用在$ code> foverlaps()中的by.x 和 by.y 参数。请看?foverlaps

To illustrate the idea clearly, I've created a function and copied dt inside the function. But you can avoid that and add the dates in ivals directly to dt and make use of by.x and by.y arguments in foverlaps(). Please look at ?foverlaps.

这篇关于根据滚动日期中存在的条件创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆