findInterval()在data.table R中具有不同的间隔 [英] findInterval() with varying intervals in data.table R

查看:359
本文介绍了findInterval()在data.table R中具有不同的间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很久以前我就问过这个问题,但还没有找到答案。我不知道这是否合法在stackoverflow,但我repost它。

I have asked this question a long time ago, but haven't found the answer yet. I do not know if this is legit in stackoverflow, but I repost it.

我在R中有一个data.table,我想创建一个新列,用于查找相应年份/月份的每个价格的间隔。

I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.

可重现的例子:

Reproducible example:

set.seed(100)
DT <- data.table(year=2000:2009, month=1:10,  price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)

我刚刚尝试过:


  • 按月/年合并 DT intervals.dt data.tables,

  • 创建一个新的 intervalsstring 列,其中包含所有的V *列到
    一列字符串(不太优雅,我承认)最后

  • 将其子字符串化为向量,以便在 findInterval()中使用它,但是该解决方案不适用于每个row(!)

  • merging the DT and intervals.dt data.tables by month/year,
  • creating a new intervalsstring column consisting of all the V* columns to one column string, (not very elegant, I admit), and finally
  • substringing it to a vector, so as I can use it in findInterval() but the solution does not work for every row (!)

所以,之后:

DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]

我获得

> DT
      year month     price               intervalsstring interval
   1: 2000     1 30.776611 12 21 36 46 48 51 63 72 91 95        2
   2: 2000     1 62.499648 12 21 36 46 48 51 63 72 91 95        6
   3: 2000     1 53.581115 12 21 36 46 48 51 63 72 91 95        6
   4: 2000     1 48.830599 12 21 36 46 48 51 63 72 91 95        5
   5: 2000     1 33.066053 12 21 36 46 48 51 63 72 91 95        2
---                                                            
3376: 2009    10 33.635924 12 40 45 48 50 65 75 90 96 97        2
3377: 2009    10 38.993769 12 40 45 48 50 65 75 90 96 97        3
3378: 2009    10 75.065820 12 40 45 48 50 65 75 90 96 97        8
3379: 2009    10  6.277403 12 40 45 48 50 65 75 90 96 97        0
3380: 2009    10 64.189162 12 40 45 48 50 65 75 90 96 97        7

第一行,但表示最后一行(或其他行)。
例如,对于行3380,价格〜64.19应该在第5个时间间隔,而不是第7个。我想我的错误是,通过我的最后一个命令,查找间隔仅依赖于 intervalsstring 的第一行。

which is correct for the first rows, but not for the last (or other) rows. For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring.

谢谢!

推荐答案

你的主要问题是你刚才没有做 findInterval 为每个组。但我也没有看到这么大的合并 data.table 粘贴 / strsplit 业务。这是我会做的:

Your main problem is that you just didn't do findInterval for each group. But I also don't see the point of making that large merged data.table, or the paste/strsplit business. This is what I would do:

DT[, interval := findInterval(price,
                              intervals.dt[.BY][, V1:V10, with = F]),
     by = .(year, month)][]
#      year month     price interval
#   1: 2000     1 30.776611        2
#   2: 2000     1 62.499648        6
#   3: 2000     1 53.581115        6
#   4: 2000     1 48.830599        5
#   5: 2000     1 33.066053        2
#  ---                              
#3376: 2009    10 33.635924        1
#3377: 2009    10 38.993769        1
#3378: 2009    10 75.065820        7
#3379: 2009    10  6.277403        0
#3380: 2009    10 64.189162        5

注意 intervals.dt [.BY] 是一个键控子集。

这篇关于findInterval()在data.table R中具有不同的间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆