R-克服“割”忽略数据表中超出范围的值 [英] R - overcome "cut" ignoring values outside of range in data table

查看:94
本文介绍了R-克服“割”忽略数据表中超出范围的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在比较两年的每日土壤湿度(SM)测量值。一年内,SM的范围从0到0.6。
在下雨多的那一年,SM的范围从0到0.8。在这些数据中,我还有一些 NA's ,其中SM传感器由于某种原因无法工作。
让我们重新创建类似的东西:

  library(data.table)
set.seed(24 )
dt1<-data.table(date = seq(as.Date( 2015-01-01),length.out = 365,by = 1 day),
sm =样本(c(NA,runif(10,min = 0,max = 0.6)),365,replace = TRUE))

dt2<-data.table(date = seq(as.Date ( 2015-01-01),length.out = 365,by = 1 day),
sm = sample(c(NA,runif(10,min = 0,max = 0.8)), 365,replace = TRUE))

我试图根据两个值之间的比例比较两个数据集每个月的SM类别。
我感兴趣的类是 seq(0,0.8,by = 0.2)。我还需要计算每月失败测量的比例( NA )。



我设法做到了在此处使用 akrun 的有用答案:

  tmp1<-dt1 [,n:= .N,month(date)] [,..(perc = 100 * .N / n [1]),
by =。(month = month(日期),
grp = cut(sm,breaks = seq(0,0.8,by = 0.2),
标签= c('0-0.2','0.2-0.4','0.4-0.6 ','0.6-0.8')))]

tmp2<-dt2 [,n:= .N,month(date)] [,。(perc = 100 * .N / n [ 1]),
by =。(月=月(日期),
grp = cut(sm,breaks = seq(0,0.8,by = 0.2),
la bels = c('0-0.2','0.2-0.4','0.4-0.6','0.6-0.8'))))

但是,输出结果并非我所期望的。 由于 dt1 中的值范围从0到0.6,因此根本没有 0.6-0.8 类别生成的数据表 tmp1



看起来像 cut 忽略最后一个类别( 0.6-0.8 ),因为在该范围内没有SM度量值。这使我的比较确实很不方便,因为在结果数据表 tmp1 tmp2 中没有相同的组



有人知道如何解决此问题,即如何强制 cut 考虑突破后的价值范围?我需要 tmp1 tmp2 中的所有SM类别,即使它们的计数为0。



仅作为参考,如果我们使用 table ,则不会发生此问题,即使它们的计数为零,该类别也会始终显示所有类别:

  t1<-符文(10,0,0.6)
t2<-符文(10,0,0.8 )

表格(cut(t1,breaks = seq(0,0.8,by = 0.2))))

(0,0.2](0.2,0.4](0.4, 0.6](0.6,0.8]
5 3 2 0
表(cut(t2,breaks = seq(0,0.8,by = 0.2)))

(0, 0.2](0.2,0.4](0.4,0.6](0.6,0.8]
1 3 2 4

任何输入得到赞赏。

解决方案

使用 CJ 进行计数所有级别,甚至包括那些未在表中显示的级别:

  f = function(d){

#创建月份列
d [,month:= month(date)]

#滚动以创建剪切组列
mdt = data.table(sm = c(NA,seq(0,.8,by = .2)))
d [,lb:= mdt [.SD,on =。(sm),roll = TRUE ,x.sm]]

#与CJ一起确保所有级别都存在
res = d [CJ(month = month,lb = mdt $ sm,unique = TRUE),在=。(month,lb),.N,by = .EACHI]

#重新缩放为每月pct
res [,pct:= N / sum(N),by = month] []

}

#试一下
f(dt1)
f(dt2)

您也可以使用 cut 进行此操作。重要的是您如何对结果进行制表,而不是对结果进行分组...


I am comparing two years worth of daily soil moisture (SM) measurements. In one year, SM ranged from 0 to 0.6. In the other year, which had more rain, SM ranged from 0 to 0.8. Amongst the data, I also have some NA's, where the SM sensor did not work for some reason. Let's re-create something similar:

library(data.table)
set.seed(24)
dt1 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"), 
                  sm=sample(c(NA, runif(10, min=0, max=0.6)), 365, replace = TRUE))

dt2 <- data.table(date=seq(as.Date("2015-01-01"), length.out=365, by="1 day"), 
                  sm=sample(c(NA, runif(10, min=0, max=0.8)), 365, replace = TRUE))

I am trying to compare both datasets based on the proportion of values between classes of SM in each month. The classes I am interested in are seq(0, 0.8, by=0.2). I also need to count the proportion of failed measurements (NA) per month.

I managed to do that by using akrun's helpful answer here: R - Calculate percentage of occurrences in data.table by month

tmp1 <- dt1[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
                                    by=.(month=month(date),
                                         grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
                                                 labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]

tmp2 <- dt2[, n := .N, month(date)][, .(perc=100 * .N/n[1]),
                                    by=.(month=month(date),
                                         grp=cut(sm, breaks=seq(0, 0.8, by=0.2),
                                                 labels = c('0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8')))]

However, the output is not exactly what I expect. Since values in dt1 range only from 0 to 0.6, there is no 0.6-0.8 category at all in the resulting data table tmp1.

It looks like cut ignores the last category (0.6-0.8) because there is no SM measurement within that range. This makes my comparison really inconvenient, because I don't have the same groups in the resulting data tables tmp1 and tmp2.

Does anybody know how to fix this, i.e. how to "force" cut to consider values outside the break range? I need all SM categories in both tmp1 and tmp2, even if their count is 0.

Just as a reference, this issue does not happen if we use table, which always shows all categories even if their count is zero:

t1 <- runif(10, 0, 0.6)
t2 <- runif(10, 0, 0.8)

table(cut(t1, breaks=seq(0, 0.8, by=0.2)))

  (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] 
        5         3         2         0 
table(cut(t2, breaks=seq(0, 0.8, by=0.2)))

  (0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] 
        1         3         2         4 

Any input appreciated.

解决方案

Use CJ to count all levels, even those that don't appear in the table:

f = function(d){

    # create month column
    d[, month := month(date)]

    # roll to make cut-group column
    mdt = data.table(sm = c(NA, seq(0, .8, by=.2)))
    d[, lb := mdt[.SD, on=.(sm), roll=TRUE, x.sm]]

    # join with CJ to ensure all levels are present
    res = d[CJ(month = month, lb = mdt$sm, unique = TRUE), on=.(month, lb), .N, by=.EACHI]

    # rescale to monthly pct
    res[, pct := N/sum(N), by=month][]

}

# try it
f(dt1)
f(dt2)

You could also do this with cut. The important thing is how you're tabulating results, not how you're grouping them...

这篇关于R-克服“割”忽略数据表中超出范围的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆