R data.table:电流测量前的计数出现次数 [英] R data.table: Count Occurrences Prior to Current Measurement
问题描述
我有一组在一段时间内进行的测量。测量的数量通常为4.在任何测量中可以捕获的数字范围是1-5(在现实生活中,给定测试集合,范围可以高达100或低至20) p>
我想每天计算在当前日期之前发生了多少个值。
我用一些示例数据解释:
#测试数据创建
d1 = list(as.Date( 5-4),4,2)
d2 = list(as.Date(2013-5-9),2,5)
d3 = list(as.Date( 5-16),3,2)
d4 = list(as.Date(2013-5-30),1,4)
d = rbind(d1,d2, d3,d4)
colnames(d)< - c(Date,V1,V2)
tt = as.data.table(d)
我想运行一个函数,它将添加5列(在可能值的范围内每个值可能为1)。在每个列中,我想要在测试日期之前出现该值的COUNT。
例如,2013-5-30的函数输出将是 C1 = 0,C2 = 3,C3 = 1,C4 = 1,C5 = 1
。
它计数了多少次:
1出现在前,不包括5/30,为零
2之前和不包括5/30,这是三个
3之前出现,不包括5/30,这是一个
等。
< blockquote>
此外,它还应包括一个列,显示该数字显示的总测量值的百分比。例如
5/30
,在5/30之前有6个测量值
pc1 =(0/6),pc2 = 3/6,pc3 = 1/6,pc4 = 1/6,pc5 = 1/6
我想使用data.table赋值符号(:=)来一次性添加这些多个列。我正在寻找的输出格式:
日期V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5
解决方案1。 data.table
首先将问题中的
t
替换为一个更常用的library(data.table)
t< - data.table(
Date = as。日期(c(2013-5-4,2013-5-9,2013-5-16,2013-5-30)),
V1 = c(4, 3,1),
V2 = c(2,5,2,4)
)
现在
tabulate
每行并使用cumsum
累积先前的行。perm
是用于重新排列C列(nc + 1:n)和PC列(nc + n + 1:n)的列编号的置换向量。 / p>
nc < - ncol(t)#3
n < - t [,max(V1,V2)] #5
Cnames < - paste0(C,1:n)
PCnames < - paste0(PC,1:n)
perm <-c(1:nc,rbind(nc + 1:n,nc + n + 1:n))
t [,(Cnames):= as.list V1,V2),n)),by = 1:nrow(t)] [,
(Cnames):= lapply(.SD,function(x)cumsum(x)-x),.SDcol = Cnames ] [,
(PCnames):= lapply(.SD,function(x)x / seq(0,len = .N,by = nc-1)),.SDcols = Cnames] [,
perm,with = FALSE]
最后一行给出:
日期V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5
1:2013-05-04 4 2 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN
2:2013-05-09 2 5 0 0 1 0.5 0 0.0000000 1 0.5000000 0 0.0000000
3:2013-05-16 3 2 0 0 2 0.5 0 0.0000000 1 0.2500000 1 0.2500000
4:2013-05-30 1 4 0 0 3 0.5 1 0.1666667 1 0.1666667 1 0.1666667
1a.data.table alternative
如果它的ok可以忽略第一个日期的行(这不是很有用,因为没有日期到第一个日期),那么我们可以执行以下乏味但直接的自连接:
t <
Date = as.Date(c(2013-5-4,2013-5-9,2013-5-16,2013-5-30)),
V1 = c(4,2,3,1),
V2 = c(2,5,2,4)
)
tt < - t [,one:= 1]
setkey(tt,one)
tt [tt ,, allow.cartesian = TRUE] [Date> Date.1,list(
C1 = sum(.SD == 1),PC1 = mean(.SD == 1),
C2 = sum(.SD == 2),PC2 = mean (.SD == 2),
C3 = sum(.SD == 3),PC3 = mean(.SD == 3),
C4 = sum = mean(.SD == 4),
C5 = sum(.SD == 5),PC5 = mean(.SD == 5)
) ),.SDcols = c(V1.1,V2.1)]
强> 1b。 data.table
,或者我们可以将其更紧凑地重写为这样(其中
tt
code> n ,Cnames
和PCnames
p>
tt [tt ,, allow.cartesian = TRUE] [Date> Date.1,setNames(as.list(rbind(
sapply(1:n,function(i,.SD)sum(.SD == i),.SD = .SD),
sapply (1:n,function(i,.SD)mean(.SD == i),.SD = .SD)
)),c(rbind(Cnames,PCnames))),
by = list(Date,V1,V2),.SDcols = c(V1.1,V2.1)]
2。 sqldf
data.table的另一种选择是使用SQL,使用这个类似乏味但简单的自连接:
库(sqldf)
sqldf(select a.Date,a.V1,a.V2,
sum(( (b.V1 = 1)+(b.V2 = 1))*(a.Date> b.Date))C1,
sum )*(a.Date> b.Date))/
cast(2 * count(*) - 2为实数)PC1,
sum(((b.V1 = 2)+ .V2 = 2))*(a.Date> b.Date))C2,
sum(((b.V1 = 2)+(b.V2 = 2))*(a.Date> b.Date))/
cast(2 * count(*) - 2 as real)PC2,
sum((b.V1 = 3)+(b.V2 = 3))* a.Date> b.Date))C3,
sum(((b.V1 = 3)+(b.V2 = 3))*(a.Date> b.Date))/
cast(2 * count(*) - 2 as real)PC3,
sum(((b.V1 = 4)+(b.V2 = 4))*(a.Date> b.Date ))C4,
sum((b.V1 = 4)+(b.V2 = 4))*(a.Date> b.Date))/
cast *)-2作为实数)PC4,
sum(((b.V1 = 5)+(b.V2 = 5))*(a.Date> b.Date))C5,
sum(((b.V1 = 5)+(b.V2 = 5))*(a.Date> b.Date))/
cast(2 * count(*) - 2 as real)PC5
from ta,tb其中a.Date> = b.Date
group by a。日期)
2a。sqldf替代
另一种方法是使用字符串操作来创建上述sql字符串,如下所示:
f < - function(i){
s< - fn $ identity(sum((b.V1 = $ i)+(b.V2 = $ i))*(a.Date> b。 Date)))
fn $ identity($ s C $ i,\\\
$ s / \\\
cast(2 * count(*) - 2 as real)PC $ i)
}
s< - fn $ identity(select a.Date,a.V1,a.V2,`toString(sapply(1:5,f))`
from ta,tb其中a。 Date> = b.Date
group by a.Date)
sqldf(s)
2b。第二个sqldf替代
如果我们愿意不做第一个日期没有前面的日期列表:
sqldf(选择a.Date,a.V1,a.V2,
sum((b.V1 = 1)+(b.V2 = 1))C1,
avg((b.V1 = 1) +(b.V2 = 1))PC1,
sum((b.V1 = 2)+(b.V2 = 2))C2,
avg((b.V1 = 2)+ b.V2 = 2))PC2,
sum((b.V1 = 3)+(b.V2 = 3))C3,
avg((b.V1 = 3) (b.V1 = 4)+(b.V2 = 4))C4,
avg((b.V1 = 4)+(b.V2 = 3))PC3,
sum C5,
avg((b.V1 = 5)+(b.V2 = 5))PC4,
sum((b.V1 = 5)+(b.V2 = )PC5
from ta,tb其中a.Date> b.Date
group by a.Date)
更新:添加了PC列和一些简化
UPDATE 2:添加其他解决方案
I've a set of measurements that are taken over a period of days. The number of measurements is typically 4. The range of numbers that can be captured in any measurement is 1-5 (in real life, given the test set, the range could be as high as 100 or as low as 20).
I want to count, per day, how many of each value has happened prior to the current day.
Let me explain with some sample data:
# test data creation d1 = list(as.Date("2013-5-4"), 4,2) d2 = list(as.Date("2013-5-9"), 2,5) d3 = list(as.Date("2013-5-16"), 3,2) d4 = list(as.Date("2013-5-30"), 1,4) d = rbind(d1,d2,d3,d4) colnames(d) <- c("Date", "V1", "V2") tt = as.data.table(d)
I want to run a function that will add 5 columns (1 per value possible in the range of possible values). in each of the columns I want the COUNT of the occurrences of that value prior to the test date.
For example, the output of the function for 2013-5-30 would be
C1=0, C2=3, C3=1, C4=1, C5=1
.It's counting how many times:
1 appeared before and not including 5/30, which is zero
2 appeared before and not including 5/30, which is three
3 appeared before and not including 5/30, which is one
etc.Additionally, it should also include a column for what percentage of the total measurements that number appears. For instance on
5/30
, there were 6 measurements before 5/30 sopc1=(0/6), pc2=3/6, pc3=1/6, pc4=1/6, pc5= 1/6
I would like to use the data.table assignment notation ( := ) to add these multiple columns all in one shot. The output that I'm looking for is of the format:
Date V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5
解决方案1. data.table
First replace the strange construct for
t
in the question with a more usual one:library(data.table) t <- data.table( Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")), V1 = c(4, 2, 3, 1), V2 = c(2, 5, 2, 4) )
Now
tabulate
each row and usecumsum
to accumulate prior rows.perm
is a permutation vector used to rearrange the column numbers of the C columns (nc + 1:n) and the PC columns (nc + n + 1:n).nc <- ncol(t) # 3 n <- t[, max(V1, V2)] # 5 Cnames <- paste0("C", 1:n) PCnames <- paste0("PC", 1:n) perm <- c(1:nc, rbind(nc + 1:n, nc + n + 1:n)) t[, (Cnames) := as.list(tabulate(c(V1, V2), n)), by = 1:nrow(t)][, (Cnames):=lapply(.SD, function(x) cumsum(x) - x), .SDcol=Cnames][, (PCnames):=lapply(.SD, function(x) x/seq(0,len=.N,by=nc-1)), .SDcols=Cnames][, perm, with = FALSE]
The last line gives:
Date V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5 1: 2013-05-04 4 2 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN 2: 2013-05-09 2 5 0 0 1 0.5 0 0.0000000 1 0.5000000 0 0.0000000 3: 2013-05-16 3 2 0 0 2 0.5 0 0.0000000 1 0.2500000 1 0.2500000 4: 2013-05-30 1 4 0 0 3 0.5 1 0.1666667 1 0.1666667 1 0.1666667
1a.data.table alternative
If its ok to omit the row of the first date (which is not very useful since there are no dates prior to the first date) then we can perform the following tedious but straight forward self join:
t <- data.table( Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")), V1 = c(4, 2, 3, 1), V2 = c(2, 5, 2, 4) ) tt <- t[, one := 1] setkey(tt, one) tt[tt,,allow.cartesian=TRUE][Date > Date.1, list( C1 = sum(.SD == 1), PC1 = mean(.SD == 1), C2 = sum(.SD == 2), PC2 = mean(.SD == 2), C3 = sum(.SD == 3), PC3 = mean(.SD == 3), C4 = sum(.SD == 4), PC4 = mean(.SD == 4), C5 = sum(.SD == 5), PC5 = mean(.SD == 5) ), by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]
1b. data.table alternative
or we can rewrite 1a more compactly as this (where
tt
,n
,Cnames
andPCnames
are from above):tt[tt,,allow.cartesian=TRUE][Date > Date.1, setNames(as.list(rbind( sapply(1:n, function(i, .SD) sum(.SD==i), .SD=.SD), sapply(1:n, function(i, .SD) mean(.SD==i), .SD=.SD) )), c(rbind(Cnames, PCnames))), by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]
2. sqldf
An alternative to data.table would be to use SQL with this similarly tedious but straight-forward self-join:
library(sqldf) sqldf("select a.Date, a.V1, a.V2, sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) C1, sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) / cast (2 * count(*) - 2 as real) PC1, sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) C2, sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) / cast (2 * count(*) - 2 as real) PC2, sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) C3, sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) / cast (2 * count(*) - 2 as real) PC3, sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) C4, sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) / cast (2 * count(*) - 2 as real) PC4, sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) C5, sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) / cast (2 * count(*) - 2 as real) PC5 from t a, t b where a.Date >= b.Date group by a.Date")
2a. sqldf alternative
An alternative would be to use string manipulation to create the above sql string like this:
f <- function(i) { s <- fn$identity("sum(((b.V1 = $i) + (b.V2 = $i)) * (a.Date > b.Date))") fn$identity("$s C$i,\n $s /\ncast (2 * count(*) - 2 as real) PC$i") } s <- fn$identity("select a.Date, a.V1, a.V2, `toString(sapply(1:5, f))` from t a, t b where a.Date >= b.Date group by a.Date") sqldf(s)
2b. second sqldf alternative
The sql solution can be simplified substantially if we are willing to do without an output row for the first date. This may make sense as the first date has no prior dates to tabulate:
sqldf("select a.Date, a.V1, a.V2, sum((b.V1 = 1) + (b.V2 = 1)) C1, avg((b.V1 = 1) + (b.V2 = 1)) PC1, sum((b.V1 = 2) + (b.V2 = 2)) C2, avg((b.V1 = 2) + (b.V2 = 2)) PC2, sum((b.V1 = 3) + (b.V2 = 3)) C3, avg((b.V1 = 3) + (b.V2 = 3)) PC3, sum((b.V1 = 4) + (b.V2 = 4)) C4, avg((b.V1 = 4) + (b.V2 = 4)) PC4, sum((b.V1 = 5) + (b.V2 = 5)) C5, avg((b.V1 = 5) + (b.V2 = 5)) PC5 from t a, t b where a.Date > b.Date group by a.Date")
Again it would be possible to create the sql string to avoid repitition in the same manner as shown in the prior solution.
UPDATE: added PC columns and some simplifications
UPDATE 2: added additional solutions
这篇关于R data.table:电流测量前的计数出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!