使用同一组中满足条件的下一个第一行设置列值 [英] Set column value using the first next row in the same group that meets a condition

查看:126
本文介绍了使用同一组中满足条件的下一个第一行设置列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R新手,这是我关于stackoverflow的第一个问题。

I am new to R and this is my first question on stackoverflow.

我正在尝试


  • 通过引用为每行分配新列


  • 使用第二行中第一行的值符合条件的同一行行


  • to assign by reference to a new column
  • for each row
  • using the value from the first next row within the same group of rows
  • that meets a condition.

示例数据:

    id code  date_down    date_up
 1:  1    p 2019-01-01 2019-01-02
 2:  1    f 2019-01-02 2019-01-03
 3:  2    f 2019-01-02 2019-01-02
 4:  2    p 2019-01-03       <NA>
 5:  3    p 2019-01-04       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05
 7:  5    f 2019-01-07 2019-01-08
 8:  5    p 2019-01-07 2019-01-08
 9:  5    p 2019-01-09 2019-01-09
10:  6    f 2019-01-10 2019-01-10
11:  6    p 2019-01-10 2019-01-10
12:  6    p 2019-01-10 2019-01-11

我会怎么做


  • 子集(组)是按 id

  • ,并针对每行

  • 找到 date_up ,然后向下搜索第一行,

  • 其中 code ='p' date-up (在找到的行中)大于 date-down 代表我要更新的行。

  • subset (group) by id
  • and for each row
  • find date_up for the first row further down,
  • where code = 'p' and date-up (of the row found) is greater than date-down for the row I am updating.

我的预期结果应为:

    id code  date_down    date_up  founddate
 1:  1    p 2019-01-01 2019-01-02       <NA>
 2:  1    f 2019-01-02 2019-01-03       <NA>
 3:  2    f 2019-01-02 2019-01-02       <NA>
 4:  2    p 2019-01-03       <NA>       <NA>
 5:  3    p 2019-01-04       <NA>       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05       <NA>
 7:  5    f 2019-01-07 2019-01-08 2019-01-08
 8:  5    p 2019-01-07 2019-01-08 2019-01-09
 9:  5    p 2019-01-09 2019-01-09       <NA>
10:  6    f 2019-01-10 2019-01-10 2019-01-11
11:  6    p 2019-01-10 2019-01-10 2019-01-11
12:  6    p 2019-01-10 2019-01-11       <NA>

我使用 .SD , .N ,用
DT [,idcount:= seq_leg(.N),by = id] <创建一个新列/ code>,但实际上并没有到位。

I have tried many variants, using .SD, .N, creating a new column with DT[, idcount:= seq_leg(.N),by=id], but not really got anywhere. Any help greatly appreciated.

还对data.table进行了很好的引用:)非常感谢

Also any good references to data.table :) Many thanks

编辑:
我已经编辑了提供的原始数据以给出一个更微妙的示例,其中第12行使用第12行的数据进行更新,因为第12行在id子集中,并且符合限定条件。第11行不符合资格标准,因此该数据不用于更新第10行。还包括我第一次使用 dput

示例数据为 dput 代码:

dt <- structure(list(
id        = c(1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L, 5L, 6L, 6L, 6L),
code      = c("p", "f", "f", "p", "p", "<NA>", "f", "p", "p", "f", "p", "p"),
date_down = structure(c(17897, 17898, 17898, 17899, 17900, 17901, 17903, 17903, 17905, 17906, 17906, 17906), class = "Date"),
date_up   = structure(c(17898, 17899, 17898, NA, NA, 17901, 17904, 17904, 17905, 17906, 17906, 17907), class = "Date")),
class     = c("data.table", "data.frame"),
row.names = c(NA, -12L))
setDT(dt)  # to reinit the internal self ref pointer (known issue)


推荐答案

将data.table按组加入自身的子集,以从行中获取值匹配不相等的条件。



摘要:


  • 下面我显示5个有效的 data.table 解决方案是针对OP的实际数据集(140万条记录)进行
    性能测试的候选方案。

    Join a data.table to a subset of itself, by group, to get values from rows matching non-equal criteria.

    Summary:

    • Below I show 5 working data.table solutions that were candidates to performance test against the OP's actual data set (1.4M records).

      所有5个解决方案在中都使用非等式联接(使用不等式比较联接的
      列)。

      All 5 solutions use "non-equi" joins (using inequality to compare columns for the join) in the on clause.

      每个解决方案只是一个小的渐进式代码更改,因此应该易于遵循
      比较不同的 data.table 选项和语法选择。

      Each solution is just a small progressive code change so it should be easy to follow along to compare different data.table options and syntax choices.

      为此目的,通过 data.table 语法进行工作,我将其分解为以下针对OP问题的步骤:

      To work through data.table syntax for this I broke it into to the following steps for the OP's problem:


      1. 将dt加入其自身的子集(或与此相关的另一个数据表)。

      2. 从dt或子集中选择并重命名所需的列。

      3. 根据dt中的列与子集中的列进行比较,定义连接条件,包括使用非-equi(非相等)比较。

      4. 可选地定义在子集中找到多个匹配记录时是选择第一个匹配还是最后一个匹配。

      1. Join the dt to a subset of itself (or another data.table for that matter).
      2. Select (and rename) the columns you want from either dt or the subset.
      3. Define the join criteria based on columns from dt compared to columns in the subset, including using "non-equi" (non-equal) comparisons.
      4. Optionally define whether first or last match should be selected when multiple matching records are found in the subset.



      解决方案1:



      Solution 1:

      # Add row numbers to all records in dt (only because you 
      # have criteria based on comparing sequential rows):
      dt[, row := .I] 
      
      # Compute result columns (  then standard assignment into dt using <-  )
      dt$found_date  <- 
                  dt[code=='p'][dt,   # join dt to the data.table matching your criteria, in this case dt[code=='p']
                                .( x.date_up ),   # columns to select, x. prefix means columns from dt[code=='p'] 
                                on = .(id==id, row > row, date_up > date_down),   # join criteria: dt[code=='p'] fields on LHS, main dt fields on RHS
                                mult = "first"]   # get only the first match if multiple matches
      


      仅获取第一个匹配项

      上面的联接表达式中的注释:

      Note in the join expressions above:


      • i 在这种情况下是您的主要目标。这样,您将从主data.table中获取所有记录。

      • x 是您要从中查找匹配值的子集(或任何其他data.table)。

      • i in this case is your main dt. This way you get all records from your main data.table.
      • x is the subset (or any other data.table) from which you want to find matching values.

      结果与请求的输出匹配:

      Result matches requested output:

      dt
      
          id code  date_down    date_up row found_date
       1:  1    p 2019-01-01 2019-01-02   1       <NA>
       2:  1    f 2019-01-02 2019-01-03   2       <NA>
       3:  2    f 2019-01-02 2019-01-02   3       <NA>
       4:  2    p 2019-01-03       <NA>   4       <NA>
       5:  3    p 2019-01-04       <NA>   5       <NA>
       6:  4 <NA> 2019-01-05 2019-01-05   6       <NA>
       7:  5    f 2019-01-07 2019-01-08   7 2019-01-08
       8:  5    p 2019-01-07 2019-01-08   8 2019-01-09
       9:  5    p 2019-01-09 2019-01-09   9       <NA>
      10:  6    f 2019-01-10 2019-01-10  10 2019-01-11
      11:  6    p 2019-01-10 2019-01-10  11 2019-01-11
      12:  6    p 2019-01-10 2019-01-11  12       <NA>
      

      注意:您可以删除列,如果愿意,可以执行 dt [,行:= NULL]

      Note: You may remove the row column by doing dt[, row := NULL] if you like.

      与上述相同的逻辑可以联接并找到结果列,但是现在使用按引用分配 := dt 中创建找到日期

      dt[, row := .I] # add row numbers (as in all the solutions)
      
      # Compute result columns (  then assign by reference into dt using :=  
      
      # dt$found_date  <- 
      dt[, found_date :=   # assign by reference to dt$found_date 
                  dt[code=='p'][dt, 
                                .( x.date_up ), 
                                on = .(id==id, row > row, date_up > date_down),
                                mult = "first"]]
      

      在解决方案2中,略有不同以赋予我们的结果通过引用到dt应该比解决方案1更有效。解决方案1的计算结果完全相同-唯一的区别是解决方案1使用标准分配<-来创建 dt $ found_date (效率较低)。

      In Solution 2, the slight variation to assign our results "by reference" into dt should be more efficient than Solution 1. Solution 1 calculated results the exact same way - the only difference is Solution 1 used standard assignment <- to create dt$found_date (less efficient).

      类似于解决方案2 ,但现在使用。(。SD)代替 dt 而不直接命名原始dt。

      Like Solution 2 but now using .(.SD) in place of dt to refer to the original dt without naming it directly.

      dt[, row := .I] # add row numbers (as in all the solutions)
      setkey(dt, id, row, date_down)  #set key for dt 
      
      # For all rows of dt, create found_date by reference :=
      dt[, found_date := 
                  # dt[code=='p'][dt, 
                  dt[code=='p'][.(.SD),   # our subset (or another data.table), joined to .SD (referring to original dt)
                                .( x.date_up ), 
                                on = .(id==id, row > row, date_up > date_down),  
                                mult = "first"] ]  
      

      .SD以上引用返回我们将分配给它的原始dt。它对应于data.table的子集,该子集包含在第一个 dt [,中选择的行,这是所有行,因为我们没有对其进行过滤。

      .SD above references back to the original dt that we are assigning back into. It corresponds to the subset of data.table that contains the rows selected in the first dt[, which is all the rows because we didn't filter it.

      注意:在解决方案3中,我使用了 setkey()来设置密钥。我应该在解决方案1和解决方案2 -但是,在@OllieB成功测试它们之后,我不想更改这些解决方案。

      Note: In Solution 3 I used setkey() to set the key. I should have done that in Solution 1 & Solution 2 - however I didn't want to change those solutions after @OllieB tested them successfully.

      类似于解决方案3 ,但使用.SD的次数比以前多了。我们的主要data.table名称 dt 现在在整个表达式中仅一次出现!

      Like Solution 3 but using .SD once more than previously. Our main data.table name dt now appears only once across our entire expression!

      # add row column and setkey() as previous solutions
      
      dt[, found_date :=
                  # dt[code=='p'][.(.SD), 
                  .SD[code=='p'][.SD,   # .SD in place of dt at left!  Also, removed .() at right (not sure on this second change)
                                 .(found_date = x.date_up),
                                 on = .(id==id, row > row, date_up > date_down),
                                 mult = "first"]]
      

      更改后的数据高于我们的数据。表名称 dt 仅出现一次。我非常喜欢它,因为它很容易在其他地方复制,改编和重用。

      With the change above our data.table name dt appears only once. I like that a lot because it makes it easy to copy, adapt and reuse elsewhere.

      还要注意:我以前在中使用过。 SD)我现在删除了 .SD 周围的。(),因为它似乎不需要它。但是,对于该更改,我不确定它是否会带来任何性能优势或data.table首选语法。如果有人可以对此发表评论,我将不胜感激。

      Also note: Where I'd previously used .(SD) I've now removed the .() around .SD because it doesn't appear to require it. However for that change I'm not sure if it has any performance benefit or whether it's data.table preferred syntax. I would be grateful if anyone can add a comment to advise on that point.

      就像以前的解决方案,但在加入时使用 by 明确将操作的子集分组

      Like previous solutions but making use of by to explicitly group subsets over operations when joining

      # add row column and setkey() as previous solutions
      
      dt[, found_date :=
             .SD[code=='p'][.SD,
                            .(found_date = x.date_up),
                            # on = .(id==id, row > row, date_up > date_down),
                            on = .(row > row, date_up > date_down),  # removed the id column from here
                            mult = "first"]
         , by = id]   # added by = id to group the .SD subsets 
      

      在最后一个解决方案中,我将其更改为使用 by 子句在 id 上将.SD子集明确分组。

      On this last solution I changed it to use the by clause to explicitly group the .SD subsets on id.

      注意:与解决方案1-4相比,解决方案5对OllieB的实际数据的表现不佳。当 id 列中唯一组的数量较少时,解决方案5的性能会很好:

      -150万中只有6个组记录了此解决方案

      -在150万条记录中有4万个小组,我看到了与OllieB所报告的类似的不良表现。

      Note: Solution 5 did not perform well against OllieB's actual data compared to Solutions 1 - 4. However, testing my own mock data I found that Solution 5 could perform well when the number of unique groups from the id column were low:
      - With only 6 groups in 1.5M records this solution worked just as fast as the others.
      - With 40k groups in 1.5M records I saw similar poor performance as OllieB reported.


      • 根据OllieB的反馈,对于OllieB的实际数据中的145万条记录,解决方案1到4的所有经过时间均为2.42秒或更短。对于OllieB,解决方案3的运行速度最快,为 elapsed = 1.22秒。

      • For 1.45M records in OllieB's actual data each of Solutions 1 to 4 were all 2.42 seconds or less "elapsed" time according to OllieB's feedback. Solution 3 appears worked fastest for OllieB having "elapsed=1.22" seconds.

      我个人更喜欢解决方案4,因为语法更简单。

      I personally prefer Solution 4 because of the simpler syntax.


      • 解决方案5(使用 by 子句)的效果不佳,需要577秒来进行OllieB的真实数据测试。

      • Solution 5 (using by clause) performed poorly taking 577 seconds for OllieB's testing on his real data.

      data.table版本:1.12 .0

      data.table version: 1.12.0

      R版本3.5.3(2019-03-11)

      R version 3.5.3 (2019-03-11)


      • 将日期字段更改为整数可能有助于更有效地连接。请参阅as.IDate()将日期转换为data.tables中的整数。

      • 可能不再需要setkey()步骤:如@Arun 所述,由于 on 调用[通常]效率更高的二级索引和自动索引。

      • Changing the date fields to integer may help join more efficiently. See as.IDate() to convert dates to integer in data.tables.
      • The setkey() step may no longer bee needed: As explained here by @Arun due to on envoking [often] more efficient secondary indicies and auto indexing.

      作为问题的一部分,您ve要求对data.table的任何良好引用。我发现以下帮助:

      As part of your question you've asked for "any good references to data.table". I've found the following helpful:

      特别是对于这个问题,值得一读:

      In particular for this problem it's worth reading:


      • < a href = https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r> .SD在R中的data.table中代表什么

      • 二级索引和自动索引

      • What does .SD stand for in data.table in R
      • The HTML vignette for Secondary indices and auto indexing

      重要说明 @Arun的答案,其中解释了实施on =参数表明不再需要设置密钥:

      Importantly note this answer by @Arun which explains "the reason for implementing on= argument" suggests it may no longer be necessary to set keys any more:


      因此,必须弄清楚是否花费在$上b $ b重新排序整个data.table值得花时间进行
      高速缓存有效的连接/聚合。通常,除非对相同的键控
      data.table执行重复的
      分组/联接操作,否则不会出现明显的差异。

      It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.


      因此,在大多数情况下,不需要再设置键
      了。我们建议尽可能使用on =,除非设置键
      可以显着改善您想利用的性能。

      In most cases therefore, there shouldn't be a need to set keys any more. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.





      • 这个SO问题似乎是有关不同 data.table 联接的信息中心:如何加入(合并)数据框(内部,

        最后是 data.table速查表是一个很好的参考(来自data.table上的链接)GitHub上的Wiki入门)。

        Finally, data.table cheat sheet is an excellent reference (from link found on the data.table Getting started Wiki on GitHub).

        一如既往,如果有人提出建议,我将不胜感激因为也许可以进一步改善。

        As always I'm grateful if anyone has suggestions as perhaps this can be improved further.

        如果可以添加任何内容,请随时发表评论,更正或发布其他解决方案。

        Please feel free to comment, correct or post other solutions if you can add anything.

        这篇关于使用同一组中满足条件的下一个第一行设置列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆