数据表:通过检查多行上的多列来创建新列 [英] Data Tables: Creating New Column By Examining Multiple Columns On Multiple Rows

查看:50
本文介绍了数据表:通过检查多行上的多列来创建新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个数据表,其中包含两列-'id'和'prevId'。我想创建一个id链,方法是从第一行开始,找到其 id等于第一行的 prevId的第二行,然后重复此过程直到 prevId为空白。

I presently have a data table containing two columns - 'id' and 'prevId'. I would like to create a chain of ids by starting with a first row, finding a second row whose 'id' equals the first row's 'prevId', and repeating this process until 'prevId' is blank.

然后,我想在链的末尾标识 id,而不添加 prevId,并创建一个新列,列出每个发起者的 id行。

I would then like to identify the 'id' in the end of the chain without a 'prevId', and create a new column listing this originator 'id' for every row.

下面是我想要的输出的示例:

Below is an example of my desired output:

   id                  prevId originatorId createdAt
1: 11a                        11a          2018/1/12 
2: 11b                 11a    11a          2018/1/13
3: 11c                 11b    11a          2018/1/14
4: 12a                        12a          2018/1/12        
5: 12b                 12a    12a          2018/1/13

任何指导将不胜感激,谢谢!

Any guidance would be appreciated, thanks!

编辑:

我注意到一个在测试一些建议的解决方案时要注意。在某些情况下,某个元素的 prevId在任何其他元素中都不会被视为 id。如果我要使用chinsoon的解决方案:

I have noticed a caveat while testing some of the proposed solutions. There may be instances where an element has a 'prevId' that is not seen as an 'id' in any other element. If I were to use to run chinsoon's solution:

DT[, originatorId:=id[1L], by=cumsum(prevId=="")]

这将导致此类元素列出最接近的对等体而没有'prevId'作为其发起者,即使该对等方不属于id-prevId链。示例:

It would cause such elements to list the closest peer without a 'prevId' as its originator, even if that peer was not part of the id-prevId chain. An example:

   id                  prevId originatorId createdAt
1: 10a                        10a          2018/1/12 
2: 11b                 11a    10a          2018/1/13
3: 11c                 11b    10a          2018/1/14

在这种情况下, 11a应该是第二个和第三个元素的始发者,但由于它不存在,因此 10a代替了它。有没有可以解决此问题的调整?

In this instance, '11a' should be the originator for the 2nd and 3rd elements, but because it isn't present, '10a' gets the spot instead. Is there a tweak that could fix this? Much appreciated.

推荐答案

现在可以通过两个递归步骤使用 dplyr data.table 功能。

This is now recursive with two recursive steps, using dplyr and data.table functionality.

dt <- structure(list(id = c("11a", "11b", "11c", "12a", "12b"), prevId = c(NA,  "11a", "11b", NA, "12a")), row.names = c(NA, -5L), class = c("data.table", "data.frame"))

data.table(left_join(x = dt
       , y = dt[,.(prevId)]
       , by = c("id" = "prevId")) %>% left_join(
                                            y = dt[,.(id,prevId)]
                                            , by = c("prevId" = "id")
       ))[, .(id, prevId, originatorId = ifelse(is.na(prevId.y), ifelse(is.na(prevId), id, prevId), prevId.y ))]

>  id   prevId  originatorId
1: 11a   <NA>          11a
2: 11b    11a          11a
3: 11c    11b          11a
4: 12a   <NA>          12a
5: 12b    12a          12a

扩展示例以合并@Michael的注释。它具有很好的可扩展性,并且可以通过在管道中添加其他联接来调整递归步骤的数量。每次迭代后,它都会保存生成的联接数据表,因此可以轻松地执行匹配步骤。最后,将每个联接的结果合并起来,结果表应该可以很好地概述数据中的ID链。

Expanded the example to incorporate the comment by @Michael. It is pretty scalable and allows to adjust the number of recursive steps, by adding additional joins into the pipe. It saves the resulting joined data.table after each iteration and thus allows to follow the matching steps pretty easily. Finally, the results of each join are combined and the resulting table should offer a good overview over the chain of ids in the data.

library(dplyr)
left_join(x = dt
          , y = dt[,.(prevId)]
          , by = c("id" = "prevId")) %>% data.table(.) %>% { . ->> dt.join.1}   %>% left_join(x = .
                                                                                 , y = dt[,.(Second.id = id, Second.prevId = prevId)]
                                                                                , by = c("prevId" = "Second.id")) %>%  data.table(.) %>% { . ->> dt.join.2}


dt.join.final.data <- rbindlist(list(  dt.join.1
                                       , dt.join.2)
                                , fill = TRUE
                                , idcol = "id"
                                , use.names = TRUE)

生成的data.table如下所示:

The resulting data.table looks then like this:

> dt.join.final.data
    id  id prevId Second.prevId
 1:  1 11a   <NA>          <NA>
 2:  1 11b    11a          <NA>
 3:  1 11c    11b          <NA>
 4:  1 12a   <NA>          <NA>
 5:  1 12b    12a          <NA>
 6:  2 11a   <NA>          <NA>
 7:  2 11b    11a          <NA>
 8:  2 11c    11b           11a
 9:  2 12a   <NA>          <NA>
10:  2 12b    12a          <NA>

这篇关于数据表:通过检查多行上的多列来创建新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆