按组从当前观察值找到行范围内的data.table列的最大值 [英] Finding max of data.table column within range of rows from current observation by group

查看:27
本文介绍了按组从当前观察值找到行范围内的data.table列的最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,这样的标题相当可口,但这是我解决的问题,我很好奇是否有人有更好的解决方案或可以将其进一步推广.

Ok so that title is quite a mouthful but here's the problem I solved and I was curious if anyone had a better solution or could generalize it further.

我有一个时间序列作为 data.table ,我感兴趣的是找出观察结果是否逆势而上",这样可以说前后的数据.IE.此观测值是否大于前后的观测值年份?

I have a time series as a data.table and I'm interested in finding out if an observation "bucks the trend" so to speak of the data before and after. I.e. Is this observation larger than the year of observations before and after ?

要做到这一点,我的想法是建立另一列,该列从上方或下方的行中获取最大值,然后仅检查一行是否等于该最大值.

To do this, my thought was to build in another column that grabs the max from the rows above or below and then just check if a row is equal to that max.

幸运的是,我的数据是有规律地排序的,这意味着每一行到相邻行的时间都是相同的.我使用这一事实来手动指定窗口大小,而不必检查每一行是否在感兴趣的时间范围内.

My data, luckily was regularly ordered, meaning that every row is the same distance of time from it's neighboring row. I use this fact to manually specify window size, rather than having to check if each row is within the time distance of interest.

#######################
# Package Loading
usePackage <- function(p) {
  if (!is.element(p, installed.packages()[,1]))
    install.packages(p, dep = TRUE)
  require(p, character.only = TRUE)
}

packages <- c("data.table","lubridate")
for(package in packages) usePackage(package)
rm(packages,usePackage)
#######################

set.seed(1337)

# creating a data.table
mydt <- data.table(Name = c(rep("Roger",12),rep("Johnny",8),"Mark"),
                   Date = c(seq(ymd('2010-06-15'),ymd('2015-12-15'), by = '6 month'),
                            seq(ymd('2012-06-15'),ymd('2015-12-15'), by = '6 month'),
                            ymd('2015-12-15')))

mydt[ , Value := c(rnorm(12,15,1),rnorm(8,30,2),rnorm(1,100,30))]
setkey(mydt, Name, Date)

# setting the number of rows up or down to check
windowSize <- 2

# applying the windowing max function
mydt[,
     windowMax := unlist(lapply(1:.N, function(x) max(.SD[Filter(function(y) y>0 & y <= .N, unique(abs(x+(-windowSize:windowSize)))), Value]))),
     by = Name]

# checking if a value is the local max (by window)
mydt[, isMaxValue := windowMax == Value]
mydt

如您所见,开窗功能虽然杂乱无章,但却可以解决问题.我的问题是:您知道做同一件事的更简单,更简洁或更易读的方式吗?您是否知道如何对此进行泛化以考虑不规则的时间序列(即不是固定的窗口)?我无法让 zoo :: rollapply 来做我想做的事,但是我没有太多的经验(我无法解决由1行组成的小组导致该功能的问题坠毁).

As you can see, the windowing function is a mess but it does the trick. My question is: do you know a simpler, more succinct, or more readable way to do the same thing? Do you know how to generalize this to take irregular time series into account (i.e. not a fixed window)? I couldn't get the zoo::rollapply to do what I wanted but I don't have that much experience with it (I couldn't solve the problem of a group with 1 row causing the function to crash).

让我知道您的想法并谢谢!!

Let me know your thoughts and thank you!

推荐答案

这并没有真正解决时间窗口部分,但是如果您想要使用 zoo :: rollapply 的单线,您可以这样做:

This doesn't really address the time-window part, but if you want a one-liner with zoo::rollapply, you can do:

width <- 2 * windowSize + 1 # One central obs. and two on each side

mydt[, isMaxValue2 := rollapply(Value, width, max, partial = TRUE) == Value, by=Name]
identical(mydt$isMaxValue, mydt$isMaxValue2) # TRUE

我认为,这比您提出的解决方案要清晰得多.

It's somewhat more legible than your proposed solution, I think.

当窗口中的观察少于5个时, partial = TRUE 参数处理边界效应".

The partial = TRUE argument deals with the "border effects" when there are less than 5 observations in the window.

这篇关于按组从当前观察值找到行范围内的data.table列的最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆