根据列在数据表中创建序列 [英] Creating a sequence in a data.table depending on a column

查看:105
本文介绍了根据列在数据表中创建序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有以下data.table:

  library(data.table)

DT < - data.table(R = sample(0:1,10000,rep = TRUE),Seq = 0)


b $ b

其中返回如下:

  R Seq 
1:1 0
2 :1 0
3:0 0
4:0 0
5:1 0
---
9996:1 0
9997:0 0
9998:0 0
9999:0 0
10000:1 0

我想生成一个序列(1,2,3,...,n),当R从上一行改变时,它将重置。



因此,上面的代码如下:

  R Seq 
1:1 1
2:1 2
3:0 1
4:0 2
5 :1 1
---
9996:1 5
9997:0 1
9998:0 2
9999:0 3
10000:1 2

想法?

解决方案

这是一个选项:

  set.seed(1)
DT< .table(R = sample(0:1,10000,rep = TRUE),Seq = 0L)
DT [,Seq:= seq diff(R))))]
DT

我们创建一个计数器,时间你的0-1变量使用 cumsum(abs(diff(R)))更改。 c(0,部分是为了确保我们得到正确的长度向量,然后用这会产生:

  R Seq 
1:0 1
2:0 2
3 :1 1
4:1 2
5:0 1
---
9996:1 1
9997:0 1
9998:1 1
9999:1 2
10000:1 3





b $ b

EDIT :请求澄清请求:



可以查看我在 by ,细分为两个新列:

  DT [,diff:= c ,diff(R))] 
DT [,cumsum:= cumsum(abs(diff))]
print(DT,topn = 10)
pre>

产生:

  R Seq diff cumsum 
1:0 1 0 0
2:0 2 0 0
3:1 1 1 1
4:1 2 0 1
5:0 1 -1 2
6:1 1 1 3
7:1 2 0 3
8:1 3 0 3
9:1 4 0 3
10:0 1 -1 4
---
9991:1 2 0 5021
9992:1 3 0 5021
9993:1 4 0 5021
9994:1 5 0 5021
9995 :0 1 -1 5022
9996:1 1 1 5023
9997:0 1 -1 5024
9998:1 1 1 5025
9999:1 2 0 5025
10000:1 3 0 5025

您可以看到diff增量的绝对值的累积和每次R改变一次。然后,我们可以使用 cumsum 列将 data.table 分成块,并为每个块生成序列使用 seq(.N)计数到块中的项目数( .N 每个组中有多少项)。


Say I have the following data.table:

library(data.table)

DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0)

Which returns something like:

       R Seq
    1: 1   0
    2: 1   0
    3: 0   0
    4: 0   0
    5: 1   0
   ---      
 9996: 1   0
 9997: 0   0
 9998: 0   0
 9999: 0   0
10000: 1   0

I want to generate a sequence (1, 2, 3,..., n) that resets whenever R changes from the previous row. Think of it like I'm counting a streak of random numbers.

So the above would then look like:

       R Seq
    1: 1   1
    2: 1   2
    3: 0   1
    4: 0   2
    5: 1   1
   ---      
 9996: 1   5
 9997: 0   1
 9998: 0   2
 9999: 0   3
10000: 1   2

Thoughts?

解决方案

Here is an option:

set.seed(1)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L)
DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
DT

We create a counter that increments every time your 0-1 variable changes using cumsum(abs(diff(R))). The c(0, part is to ensure we get the correct length vector. Then we split by it with by. This produces:

       R Seq
    1: 0   1
    2: 0   2
    3: 1   1
    4: 1   2
    5: 0   1
   ---      
 9996: 1   1
 9997: 0   1
 9998: 1   1
 9999: 1   2
10000: 1   3


EDIT: Addressing request for clarification:

lets look at the computation I'm using in by, broken down into two new columns:

DT[, diff:=c(0, diff(R))]
DT[, cumsum:=cumsum(abs(diff))]
print(DT, topn=10)

Produces:

       R Seq diff cumsum
    1: 0   1    0      0
    2: 0   2    0      0
    3: 1   1    1      1
    4: 1   2    0      1
    5: 0   1   -1      2
    6: 1   1    1      3
    7: 1   2    0      3
    8: 1   3    0      3
    9: 1   4    0      3
   10: 0   1   -1      4
   ---                  
 9991: 1   2    0   5021
 9992: 1   3    0   5021
 9993: 1   4    0   5021
 9994: 1   5    0   5021
 9995: 0   1   -1   5022
 9996: 1   1    1   5023
 9997: 0   1   -1   5024
 9998: 1   1    1   5025
 9999: 1   2    0   5025
10000: 1   3    0   5025

You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that cumsum column to break up the data.table into chunks, and for each chunk, generate a sequence using seq(.N) that counts to the number of items in the chunk (.N represents exactly that, how many items in each by group).

这篇关于根据列在数据表中创建序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆