根据列在数据表中创建序列 [英] Creating a sequence in a data.table depending on a column
问题描述
说我有以下data.table:
library(data.table)
DT < - data.table(R = sample(0:1,10000,rep = TRUE),Seq = 0)
b $ b
其中返回如下:
R Seq
1:1 0
2 :1 0
3:0 0
4:0 0
5:1 0
---
9996:1 0
9997:0 0
9998:0 0
9999:0 0
10000:1 0
我想生成一个序列(1,2,3,...,n),当R从上一行改变时,它将重置。
因此,上面的代码如下:
R Seq
1:1 1
2:1 2
3:0 1
4:0 2
5 :1 1
---
9996:1 5
9997:0 1
9998:0 2
9999:0 3
10000:1 2
想法?
这是一个选项:
set.seed(1)
DT< .table(R = sample(0:1,10000,rep = TRUE),Seq = 0L)
DT [,Seq:= seq diff(R))))]
DT
我们创建一个计数器,时间你的0-1变量使用 cumsum(abs(diff(R)))
更改。 c(0,
部分是为了确保我们得到正确的长度向量,然后用和
这会产生:
R Seq
1:0 1
2:0 2
3 :1 1
4:1 2
5:0 1
---
9996:1 1
9997:0 1
9998:1 1
9999:1 2
10000:1 3
b $ b
EDIT :请求澄清请求:
可以查看我在 by
,细分为两个新列:
DT [,diff:= c ,diff(R))]
pre>
DT [,cumsum:= cumsum(abs(diff))]
print(DT,topn = 10)
产生:
R Seq diff cumsum
1:0 1 0 0
2:0 2 0 0
3:1 1 1 1
4:1 2 0 1
5:0 1 -1 2
6:1 1 1 3
7:1 2 0 3
8:1 3 0 3
9:1 4 0 3
10:0 1 -1 4
---
9991:1 2 0 5021
9992:1 3 0 5021
9993:1 4 0 5021
9994:1 5 0 5021
9995 :0 1 -1 5022
9996:1 1 1 5023
9997:0 1 -1 5024
9998:1 1 1 5025
9999:1 2 0 5025
10000:1 3 0 5025
您可以看到diff增量的绝对值的累积和每次R改变一次。然后,我们可以使用
cumsum
列将data.table
分成块,并为每个块生成序列使用seq(.N)
计数到块中的项目数(.N
每个由
组中有多少项)。Say I have the following data.table:
library(data.table) DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0)
Which returns something like:
R Seq 1: 1 0 2: 1 0 3: 0 0 4: 0 0 5: 1 0 --- 9996: 1 0 9997: 0 0 9998: 0 0 9999: 0 0 10000: 1 0
I want to generate a sequence (1, 2, 3,..., n) that resets whenever R changes from the previous row. Think of it like I'm counting a streak of random numbers.
So the above would then look like:
R Seq 1: 1 1 2: 1 2 3: 0 1 4: 0 2 5: 1 1 --- 9996: 1 5 9997: 0 1 9998: 0 2 9999: 0 3 10000: 1 2
Thoughts?
解决方案Here is an option:
set.seed(1) DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L) DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))] DT
We create a counter that increments every time your 0-1 variable changes using
cumsum(abs(diff(R)))
. Thec(0,
part is to ensure we get the correct length vector. Then we split by it withby
. This produces:R Seq 1: 0 1 2: 0 2 3: 1 1 4: 1 2 5: 0 1 --- 9996: 1 1 9997: 0 1 9998: 1 1 9999: 1 2 10000: 1 3
EDIT: Addressing request for clarification:
lets look at the computation I'm using in
by
, broken down into two new columns:DT[, diff:=c(0, diff(R))] DT[, cumsum:=cumsum(abs(diff))] print(DT, topn=10)
Produces:
R Seq diff cumsum 1: 0 1 0 0 2: 0 2 0 0 3: 1 1 1 1 4: 1 2 0 1 5: 0 1 -1 2 6: 1 1 1 3 7: 1 2 0 3 8: 1 3 0 3 9: 1 4 0 3 10: 0 1 -1 4 --- 9991: 1 2 0 5021 9992: 1 3 0 5021 9993: 1 4 0 5021 9994: 1 5 0 5021 9995: 0 1 -1 5022 9996: 1 1 1 5023 9997: 0 1 -1 5024 9998: 1 1 1 5025 9999: 1 2 0 5025 10000: 1 3 0 5025
You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that
cumsum
column to break up thedata.table
into chunks, and for each chunk, generate a sequence usingseq(.N)
that counts to the number of items in the chunk (.N
represents exactly that, how many items in eachby
group).这篇关于根据列在数据表中创建序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!