R data.table:电流测量前的计数出现次数 [英] R data.table: Count Occurrences Prior to Current Measurement

查看:217
本文介绍了R data.table:电流测量前的计数出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组在一段时间内进行的测量。测量的数量通常为4.在任何测量中可以捕获的数字范围是1-5(在现实生活中,给定测试集合,范围可以高达100或低至20) p>

我想每天计算在当前日期之前发生了多少个值。



我用一些示例数据解释:

 #测试数据创建
d1 = list(as.Date( 5-4),4,2)
d2 = list(as.Date(2013-5-9),2,5)
d3 = list(as.Date( 5-16),3,2)
d4 = list(as.Date(2013-5-30),1,4)

d = rbind(d1,d2, d3,d4)
colnames(d)< - c(Date,V1,V2)

tt = as.data.table(d)

我想运行一个函数,它将添加5列(在可能值的范围内每个值可能为1)。在每个列中,我想要在测试日期之前出现该值的COUNT。



例如,2013-5-30的函数输出将是 C1 = 0,C2 = 3,C3 = 1,C4 = 1,C5 = 1



它计数了多少次:


1出现在前,不包括5/30,为零

2之前和不包括5/30,这是三个

3之前出现,不包括5/30,这是一个

等。


< blockquote>

此外,它还应包括一个列,显示该数字显示的总测量值的百分比。例如 5/30 ,在5/30之前有6个测量值


pc1 =(0/6),pc2 = 3/6,pc3 = 1/6,pc4 = 1/6,pc5 = 1/6


我想使用data.table赋值符号(:=)来一次性添加这些多个列。我正在寻找的输出格式:

 日期V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5 


解决方案

1。 data.table



首先将问题中的 t 替换为一个更常用的

  library(data.table)
t< - data.table(
Date = as。日期(c(2013-5-4,2013-5-9,2013-5-16,2013-5-30)),
V1 = c(4, 3,1),
V2 = c(2,5,2,4)

现在 tabulate 每行并使用 cumsum 累积先前的行。 perm 是用于重新排列C列(nc + 1:n)和PC列(nc + n + 1:n)的列编号的置换向量。 / p>

  nc < -  ncol(t)#3 
n < - t [,max(V1,V2)] #5

Cnames < - paste0(C,1:n)
PCnames < - paste0(PC,1:n)

perm <-c(1:nc,rbind(nc + 1:n,nc + n + 1:n))

t [,(Cnames):= as.list V1,V2),n)),by = 1:nrow(t)] [,
(Cnames):= lapply(.SD,function(x)cumsum(x)-x),.SDcol = Cnames ] [,
(PCnames):= lapply(.SD,function(x)x / seq(0,len = .N,by = nc-1)),.SDcols = Cnames] [,
perm,with = FALSE]

最后一行给出:

 日期V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5 
1:2013-05-04 4 2 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN
2:2013-05-09 2 5 0 0 1 0.5 0 0.0000000 1 0.5000000 0 0.0000000
3:2013-05-16 3 2 0 0 2 0.5 0 0.0000000 1 0.2500000 1 0.2500000
4:2013-05-30 1 4 0 0 3 0.5 1 0.1666667 1 0.1666667 1 0.1666667

1a.data.table alternative



如果它的ok可以忽略第一个日期的行(这不是很有用,因为没有日期到第一个日期),那么我们可以执行以下乏味但直接的自连接:

  t < 
Date = as.Date(c(2013-5-4,2013-5-9,2013-5-16,2013-5-30)),
V1 = c(4,2,3,1),
V2 = c(2,5,2,4)

tt < - t [,one:= 1]
setkey(tt,one)
tt [tt ,, allow.cartesian = TRUE] [Date> Date.1,list(
C1 = sum(.SD == 1),PC1 = mean(.SD == 1),
C2 = sum(.SD == 2),PC2 = mean (.SD == 2),
C3 = sum(.SD == 3),PC3 = mean(.SD == 3),
C4 = sum = mean(.SD == 4),
C5 = sum(.SD == 5),PC5 = mean(.SD == 5)
) ),.SDcols = c(V1.1,V2.1)]

强> 1b。 data.table



,或者我们可以将其更紧凑地重写为这样(其中 tt code> n Cnames PCnames p>

  tt [tt ,, allow.cartesian = TRUE] [Date> Date.1,setNames(as.list(rbind(
sapply(1:n,function(i,.SD)sum(.SD == i),.SD = .SD),
sapply (1:n,function(i,.SD)mean(.SD == i),.SD = .SD)
)),c(rbind(Cnames,PCnames))),
by = list(Date,V1,V2),.SDcols = c(V1.1,V2.1)]

2。 sqldf



data.table的另一种选择是使用SQL,使用这个类似乏味但简单的自连接:

 库(sqldf)
sqldf(select a.Date,a.V1,a.V2,
sum(( (b.V1 = 1)+(b.V2 = 1))*(a.Date> b.Date))C1,
sum )*(a.Date> b.Date))/
cast(2 * count(*) - 2为实数)PC1,
sum(((b.V1 = 2)+ .V2 = 2))*(a.Date> b.Date))C2,
sum(((b.V1 = 2)+(b.V2 = 2))*(a.Date> b.Date))/
cast(2 * count(*) - 2 as real)PC2,
sum((b.V1 = 3)+(b.V2 = 3))* a.Date> b.Date))C3,
sum(((b.V1 = 3)+(b.V2 = 3))*(a.Date> b.Date))/
cast(2 * count(*) - 2 as real)PC3,
sum(((b.V1 = 4)+(b.V2 = 4))*(a.Date> b.Date ))C4,
sum((b.V1 = 4)+(b.V2 = 4))*(a.Date> b.Date))/
cast *)-2作为实数)PC4,
sum(((b.V1 = 5)+(b.V2 = 5))*(a.Date> b.Date))C5,
sum(((b.V1 = 5)+(b.V2 = 5))*(a.Date> b.Date))/
cast(2 * count(*) - 2 as real)PC5
from ta,tb其中a.Date> = b.Date
group by a。日期)

2a。sqldf替代



另一种方法是使用字符串操作来创建上述sql字符串,如下所示:

  f <  -  function(i){
s< - fn $ identity(sum((b.V1 = $ i)+(b.V2 = $ i))*(a.Date> b。 Date)))
fn $ identity($ s C $ i,\\\
$ s / \\\
cast(2 * count(*) - 2 as real)PC $ i)
}
s< - fn $ identity(select a.Date,a.V1,a.V2,`toString(sapply(1:5,f))`
from ta,tb其中a。 Date> = b.Date
group by a.Date)

sqldf(s)

2b。第二个sqldf替代



如果我们愿意不做第一个日期没有前面的日期列表:

  sqldf(选择a.Date,a.V1,a.V2,
sum((b.V1 = 1)+(b.V2 = 1))C1,
avg((b.V1 = 1) +(b.V2 = 1))PC1,
sum((b.V1 = 2)+(b.V2 = 2))C2,
avg((b.V1 = 2)+ b.V2 = 2))PC2,
sum((b.V1 = 3)+(b.V2 = 3))C3,
avg((b.V1 = 3) (b.V1 = 4)+(b.V2 = 4))C4,
avg((b.V1 = 4)+(b.V2 = 3))PC3,
sum C5,
avg((b.V1 = 5)+(b.V2 = 5))PC4,
sum((b.V1 = 5)+(b.V2 = )PC5
from ta,tb其中a.Date> b.Date
group by a.Date)



更新:添加了PC列和一些简化



UPDATE 2:添加其他解决方案


I've a set of measurements that are taken over a period of days. The number of measurements is typically 4. The range of numbers that can be captured in any measurement is 1-5 (in real life, given the test set, the range could be as high as 100 or as low as 20).

I want to count, per day, how many of each value has happened prior to the current day.

Let me explain with some sample data:

# test data creation
d1 = list(as.Date("2013-5-4"),  4,2)
d2 = list(as.Date("2013-5-9"),  2,5)
d3 = list(as.Date("2013-5-16"), 3,2)
d4 = list(as.Date("2013-5-30"), 1,4)

d = rbind(d1,d2,d3,d4)
colnames(d) <- c("Date", "V1", "V2")

tt = as.data.table(d)

I want to run a function that will add 5 columns (1 per value possible in the range of possible values). in each of the columns I want the COUNT of the occurrences of that value prior to the test date.

For example, the output of the function for 2013-5-30 would be C1=0, C2=3, C3=1, C4=1, C5=1.

It's counting how many times:

1 appeared before and not including 5/30, which is zero
2 appeared before and not including 5/30, which is three
3 appeared before and not including 5/30, which is one
etc.

Additionally, it should also include a column for what percentage of the total measurements that number appears. For instance on 5/30, there were 6 measurements before 5/30 so

pc1=(0/6), pc2=3/6, pc3=1/6, pc4=1/6, pc5= 1/6

I would like to use the data.table assignment notation ( := ) to add these multiple columns all in one shot. The output that I'm looking for is of the format:

Date V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5

解决方案

1. data.table

First replace the strange construct for t in the question with a more usual one:

library(data.table)
t <- data.table(
  Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
  V1 = c(4, 2, 3, 1),
  V2 = c(2, 5, 2, 4)
)

Now tabulate each row and use cumsum to accumulate prior rows. perm is a permutation vector used to rearrange the column numbers of the C columns (nc + 1:n) and the PC columns (nc + n + 1:n).

nc <- ncol(t) # 3
n <- t[, max(V1, V2)] # 5

Cnames <- paste0("C", 1:n)
PCnames <- paste0("PC", 1:n)

perm <- c(1:nc, rbind(nc + 1:n, nc + n + 1:n))

t[, (Cnames) := as.list(tabulate(c(V1, V2), n)), by = 1:nrow(t)][, 
 (Cnames):=lapply(.SD, function(x) cumsum(x) - x), .SDcol=Cnames][,
 (PCnames):=lapply(.SD, function(x) x/seq(0,len=.N,by=nc-1)), .SDcols=Cnames][, 
 perm, with = FALSE]

The last line gives:

         Date V1 V2 C1 PC1 C2 PC2 C3       PC3 C4       PC4 C5       PC5
1: 2013-05-04  4  2  0 NaN  0 NaN  0       NaN  0       NaN  0       NaN
2: 2013-05-09  2  5  0   0  1 0.5  0 0.0000000  1 0.5000000  0 0.0000000
3: 2013-05-16  3  2  0   0  2 0.5  0 0.0000000  1 0.2500000  1 0.2500000
4: 2013-05-30  1  4  0   0  3 0.5  1 0.1666667  1 0.1666667  1 0.1666667

1a.data.table alternative

If its ok to omit the row of the first date (which is not very useful since there are no dates prior to the first date) then we can perform the following tedious but straight forward self join:

t <- data.table(
  Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
  V1 = c(4, 2, 3, 1),
  V2 = c(2, 5, 2, 4)
)
tt <- t[, one := 1]
setkey(tt, one)
tt[tt,,allow.cartesian=TRUE][Date > Date.1, list(
    C1 = sum(.SD == 1), PC1 = mean(.SD == 1), 
    C2 = sum(.SD == 2), PC2 = mean(.SD == 2), 
    C3 = sum(.SD == 3), PC3 = mean(.SD == 3), 
    C4 = sum(.SD == 4), PC4 = mean(.SD == 4), 
    C5 = sum(.SD == 5), PC5 = mean(.SD == 5)
), by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]

1b. data.table alternative

or we can rewrite 1a more compactly as this (where tt, n, Cnames and PCnames are from above):

tt[tt,,allow.cartesian=TRUE][Date > Date.1, setNames(as.list(rbind(
   sapply(1:n, function(i, .SD) sum(.SD==i), .SD=.SD),
   sapply(1:n, function(i, .SD) mean(.SD==i), .SD=.SD)
  )), c(rbind(Cnames, PCnames))),
  by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]

2. sqldf

An alternative to data.table would be to use SQL with this similarly tedious but straight-forward self-join:

library(sqldf)
sqldf("select a.Date, a.V1, a.V2, 
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) C1,
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC1,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) C2,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC2,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) C3,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC3,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) C4,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC4,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) C5,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC5
from t a, t b where a.Date >= b.Date
group by a.Date")

2a. sqldf alternative

An alternative would be to use string manipulation to create the above sql string like this:

f <- function(i) {
    s <- fn$identity("sum(((b.V1 = $i) + (b.V2 = $i)) * (a.Date > b.Date))")
    fn$identity("$s C$i,\n $s /\ncast (2 * count(*) - 2 as real) PC$i")
}
s <- fn$identity("select a.Date, a.V1, a.V2, `toString(sapply(1:5, f))`
    from t a, t b where a.Date >= b.Date
    group by a.Date")

sqldf(s)

2b. second sqldf alternative

The sql solution can be simplified substantially if we are willing to do without an output row for the first date. This may make sense as the first date has no prior dates to tabulate:

sqldf("select a.Date, a.V1, a.V2, 
sum((b.V1 = 1) + (b.V2 = 1)) C1,
avg((b.V1 = 1) + (b.V2 = 1)) PC1,
sum((b.V1 = 2) + (b.V2 = 2)) C2,
avg((b.V1 = 2) + (b.V2 = 2)) PC2,
sum((b.V1 = 3) + (b.V2 = 3)) C3,
avg((b.V1 = 3) + (b.V2 = 3)) PC3,
sum((b.V1 = 4) + (b.V2 = 4)) C4,
avg((b.V1 = 4) + (b.V2 = 4)) PC4,
sum((b.V1 = 5) + (b.V2 = 5)) C5,
avg((b.V1 = 5) + (b.V2 = 5)) PC5
from t a, t b where a.Date > b.Date
group by a.Date")

Again it would be possible to create the sql string to avoid repitition in the same manner as shown in the prior solution.

UPDATE: added PC columns and some simplifications

UPDATE 2: added additional solutions

这篇关于R data.table:电流测量前的计数出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆