根据滚动日期内存在的条件创建新列 [英] Create new column based on condition that exists within a rolling date

查看:34
本文介绍了根据滚动日期内存在的条件创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了使这个问题更笼统,我相信它也可以改写为:创建一个滚动的时间敏感因子变量.虽然这是一个不常见的要求,但这可以用于许多不同的数据源.

To make this question more generalized, I believe it could also be rephrased as: Creating a rolling temporally sensitive factor variable. Though an uncommon requirement, this could be utilized for many different data sources.

我有一系列非统一时间数据,每天有 > 1 条记录,供成千上万的用户使用.我想创建一个新列 player_type 来跟踪他们行为的滚动 30 天定义.行为由他们玩的游戏定义;'games' 列是 gameA、gameB 的一个因子.

I have a series of non-uniform time data with > 1 record per day for thousands of users. I want to create a new column player_type that keeps track of a rolling 30 day definition of their behavior. The behavior is defined by what games they play; the column 'games' is a factor of gameA, gameB.

因此存在三种类型的行为:

There are thus three types of behaviors:

  1. 独玩GameA - 'A'
  2. 独玩GameB - 'B'
  3. 玩两种游戏 - 'Hybrid'

我想使用这个新列来查看他们的游戏行为随时间的变化,以及计算每个组中的玩家数量,看看他们是如何变化的.

I want to use this new column to see the changes in their play behavior over time, as well as counting the number of players in each group throughout time, to see how they change.

每个玩家的时间序列非常不规则.玩家可以每天玩多种类型的游戏,或者几个月不玩任何游戏.每个玩家的时间序列是不规则的,因此只有在玩家玩游戏时才会创建记录,因此我希望解决方案可能会使用类似以下的过滤器:

The time series is highly irregular for each player. Players can play multiple types of games per day, or not play any games for many months. The time series is irregular per player such that a record is only created when the player plays a game, thus I expect a solution might use a filter something like:

interval(current_date, current_date - new_period(days=30)(使用 lubridate).

interval(current_date, current_date - new_period(days=30) (using lubridate).

这是一个示例数据集.请记住,这是简化并测试滚动 1 天的更改,因此之前检查记录的简单方法实际上不起作用.如果你能做出更好的数据集,请指教,我会编辑这篇文章.

Here is an example data set. Keep in mind this it is simplified and tests a rolling 1 day change, so simple methods checking the record before will not actually work. If you are able to make a better data set, please advise and I will edit this post.

p <- c( 1,   1,   1,   2,   2,   2,   6,   6,   6)

g <- c('A', 'B', 'B', 'A', 'B', 'A', 'A', 'B', 'B')

d <- seq(as.Date('2014-10-01'), as.Date('2014-10-9'), by=1)

df <- data.frame(player_id = p, date = d, games = g)

作为我需要的输出:

 player_id       date games   type
1         1 2014-10-01     A      A (OR NA)
2         1 2014-10-02     B Hybrid
3         1 2014-10-03     B      B
4         2 2014-10-04     A      A (OR NA)
5         2 2014-10-05     B Hybrid
6         2 2014-10-06     A Hybrid
7         6 2014-10-07     A      A (OR NA)
8         6 2014-10-08     B Hybrid
9         6 2014-10-09     B      B

解决方案应该类似于,apply 通过列,并应用一个检查 30 天时间的函数,以及一个 ifelse() 语句以查看什么他们玩的游戏.

The solution should be something like, apply through the columns, and apply a function which checks back 30 days in time, and an ifelse() statement to see what games they played.

这是一个非常相似的帖子 - 应该有助于解决这个问题.我该怎么做仅在特定日期条件之间查找的条件总和

This is a very similar post - and should help solve this problem. How do I do a conditional sum which only looks between certain date criteria

我也探索过,rowwise() 和条件 mutates() 使用 dplyr,但捕获是我的历史时间组件.

I have also explored, rowwise() and conditional mutates() using dplyr, however the catch is the historical time component for me.

感谢大家的帮助!我不能感谢这个论坛.我会经常回来查看的.

Thanks for all the help! I can't thank this forum enough. I'll be checking back frequently.

推荐答案

假设我理解正确,这里是一个使用 foverlaps() 函数的 data.table 方式.

Assuming that I understood it right, here's a data.table way using foverlaps() function.

创建dt并设置key如下图:

Create dt and set key as shown below:

dt <- data.table(player_id = p, games = g, date = d, end_date = d)
setkey(dt, player_id, date, end_date)

hybrid_index <- function(dt, roll_days) {
    ivals = copy(dt)[, date := date-roll_days]
    olaps = foverlaps(ivals, dt, type="any", which=TRUE)
    olaps[, val := dt$games[xid] != dt$games[yid]]
    olaps[, any(val), by=xid][(V1), xid]
}

我们创建一个虚拟 data.table ivals(用于间隔),并为每一行指定 startend 日期.请注意,通过将 end_date 指定为与​​ dt$end_date 相同,我们肯定会有一个匹配项(这是故意的) - 这将为您提供非 NA 版本要求.

We create a dummy data.table ivals (for intervals), and for each row, we specify the start and the end dates. Note that by specifying end_date identical as dt$end_date, we'll definitely have one match (and this is deliberate) - this'll give you the non-NA version you ask for.

[这里有一些小的改动,你可以获得 NA 版本,但我会留给你(假设这个答案是正确的).]

[With some minor changes here, you can get the NA version, but I'll leave that to you (assuming this answer is right).]

这样,我们只需找到 ivals 中的哪些范围与 dt 重叠,for each player_id.我们得到匹配的索引.从那里开始很简单.如果玩家的游戏是非同质的,那么我们从 hybrid_index 中返回 dt 的相应索引.我们将这些索引替换为混合".

With that we simply find which ranges from ivals overlaps with dt, for each player_id. We get the matching indices. From there it's straightforward. If a player's game is non-homogeneous, then we return the corresponding index of dt from hybrid_index. And we replace those indices with "hybrid".

# roll days = 1L
dt[, type := games][hybrid_index(dt, 1L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03      B
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09      B

# roll days = 2L
dt[, type := games][hybrid_index(dt, 2L), type := "hybrid"]
#    player_id games       date   end_date   type
# 1:         1     A 2014-10-01 2014-10-01      A
# 2:         1     B 2014-10-02 2014-10-02 hybrid
# 3:         1     B 2014-10-03 2014-10-03 hybrid
# 4:         2     A 2014-10-04 2014-10-04      A
# 5:         2     B 2014-10-05 2014-10-05 hybrid
# 6:         2     A 2014-10-06 2014-10-06 hybrid
# 7:         6     A 2014-10-07 2014-10-07      A
# 8:         6     B 2014-10-08 2014-10-08 hybrid
# 9:         6     B 2014-10-09 2014-10-09 hybrid

为了清楚地说明这个想法,我创建了一个函数并在函数内部复制了 dt.但是您可以避免这种情况,并将 ivals 中的日期直接添加到 dt 并使用 by.xby.y<foverlaps() 中的/code> 参数.请看?foverlaps.

To illustrate the idea clearly, I've created a function and copied dt inside the function. But you can avoid that and add the dates in ivals directly to dt and make use of by.x and by.y arguments in foverlaps(). Please look at ?foverlaps.

这篇关于根据滚动日期内存在的条件创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆