R-克服“割”忽略数据表中超出范围的值 [英] R - overcome "cut" ignoring values outside of range in data table
问题描述
在下雨多的那一年,SM的范围从0到0.8。在这些数据中,我还有一些
NA's
,其中SM传感器由于某种原因无法工作。 让我们重新创建类似的东西:
library(data.table)
set.seed(24 )
dt1<-data.table(date = seq(as.Date( 2015-01-01),length.out = 365,by = 1 day),
sm =样本(c(NA,runif(10,min = 0,max = 0.6)),365,replace = TRUE))
dt2<-data.table(date = seq(as.Date ( 2015-01-01),length.out = 365,by = 1 day),
sm = sample(c(NA,runif(10,min = 0,max = 0.8)), 365,replace = TRUE))
我试图根据两个值之间的比例比较两个数据集每个月的SM类别。
我感兴趣的类是 seq(0,0.8,by = 0.2)
。我还需要计算每月失败测量的比例( NA
)。
tmp1<-dt1 [,n:= .N,month(date)] [,..(perc = 100 * .N / n [1]),
by =。(month = month(日期),
grp = cut(sm,breaks = seq(0,0.8,by = 0.2),
标签= c('0-0.2','0.2-0.4','0.4-0.6 ','0.6-0.8')))]
tmp2<-dt2 [,n:= .N,month(date)] [,。(perc = 100 * .N / n [ 1]),
by =。(月=月(日期),
grp = cut(sm,breaks = seq(0,0.8,by = 0.2),
la bels = c('0-0.2','0.2-0.4','0.4-0.6','0.6-0.8'))))
但是,输出结果并非我所期望的。 由于 dt1
中的值范围从0到0.6,因此根本没有 0.6-0.8
类别生成的数据表 tmp1
。
看起来像 cut
忽略最后一个类别( 0.6-0.8
),因为在该范围内没有SM度量值。这使我的比较确实很不方便,因为在结果数据表 tmp1
和 tmp2
中没有相同的组
有人知道如何解决此问题,即如何强制 cut
考虑突破后的价值范围?我需要 tmp1
和 tmp2
中的所有SM类别,即使它们的计数为0。
仅作为参考,如果我们使用 table
,则不会发生此问题,即使它们的计数为零,该类别也会始终显示所有类别:
t1<-符文(10,0,0.6)
t2<-符文(10,0,0.8 )
表格(cut(t1,breaks = seq(0,0.8,by = 0.2))))
(0,0.2](0.2,0.4](0.4, 0.6](0.6,0.8]
5 3 2 0
表(cut(t2,breaks = seq(0,0.8,by = 0.2)))
(0, 0.2](0.2,0.4](0.4,0.6](0.6,0.8]
1 3 2 4
任何输入得到赞赏。
使用 CJ
进行计数所有级别,甚至包括那些未在表中显示的级别:
f = function(d){
#创建月份列
d [,month:= month(date)]
#滚动以创建剪切组列
mdt = data.table(sm = c(NA,seq(0,.8,by = .2)))
d [,lb:= mdt [.SD,on =。(sm),roll = TRUE ,x.sm]]
#与CJ一起确保所有级别都存在
res = d [CJ(month = month,lb = mdt $ sm,unique = TRUE),在=。(month,lb),.N,by = .EACHI]
#重新缩放为每月pct
res [,pct:= N / sum(N),by = month] []
}
#试一下
f(dt1)
f(dt2)
您也可以使用 cut
进行此操作。重要的是您如何对结果进行制表,而不是对结果进行分组...
I am comparing two years worth of daily soil moisture (SM) measurements. In one year, SM ranged from 0 to 0.6.
In the other year, which had more rain, SM ranged from 0 to 0.8. Amongst the data, I also have some NA's
, where the SM sensor did not work for some reason.
Let's re-create something similar:
library(data.table)
set.seed(24)
dt1 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"),
sm=sample(c(NA, runif(10, min=0, max=0.6)), 365, replace = TRUE))
dt2 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"),
sm=sample(c(NA, runif(10, min=0, max=0.8)), 365, replace = TRUE))
I am trying to compare both datasets based on the proportion of values between classes of SM in each month.
The classes I am interested in are seq(0, 0.8, by=0.2)
. I also need to count the proportion of failed measurements (NA
) per month.
I managed to do that by using akrun
's helpful answer here:
R - Calculate percentage of occurrences in data.table by month
tmp1 <- dt1[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
by=.(month=month(date),
grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]
tmp2 <- dt2[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
by=.(month=month(date),
grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]
However, the output is not exactly what I expect. Since values in dt1
range only from 0 to 0.6, there is no 0.6-0.8
category at all in the resulting data table tmp1
.
It looks like cut
ignores the last category (0.6-0.8
) because there is no SM measurement within that range. This makes my comparison really inconvenient, because I don't have the same groups in the resulting data tables tmp1
and tmp2
.
Does anybody know how to fix this, i.e. how to "force" cut
to consider values outside the break range? I need all SM categories in both tmp1
and tmp2
, even if their count is 0.
Just as a reference, this issue does not happen if we use table
, which always shows all categories even if their count is zero:
t1 <- runif(10, 0, 0.6)
t2 <- runif(10, 0, 0.8)
table(cut(t1, breaks=seq(0, 0.8, by=0.2)))
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]
5 3 2 0
table(cut(t2, breaks=seq(0, 0.8, by=0.2)))
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]
1 3 2 4
Any input appreciated.
Use CJ
to count all levels, even those that don't appear in the table:
f = function(d){
# create month column
d[, month := month(date)]
# roll to make cut-group column
mdt = data.table(sm = c(NA, seq(0, .8, by=.2)))
d[, lb := mdt[.SD, on=.(sm), roll=TRUE, x.sm]]
# join with CJ to ensure all levels are present
res = d[CJ(month = month, lb = mdt$sm, unique = TRUE), on=.(month, lb), .N, by=.EACHI]
# rescale to monthly pct
res[, pct := N/sum(N), by=month][]
}
# try it
f(dt1)
f(dt2)
You could also do this with cut
. The important thing is how you're tabulating results, not how you're grouping them...
这篇关于R-克服“割”忽略数据表中超出范围的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!