为什么中值跳上data.table(整数对双)? [英] Why does median trip up data.table (integer versus double)?

查看:105
本文介绍了为什么中值跳上data.table(整数对双)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为enc.per.day的data.table每天遇到的。它有2403行,其中指定了服务日期和在那天看到的患者数量。我想查看在任何类型的工作日中看到的患者的中位数。

  enc.per.day [,list .encounters = median(n)),by = list(weekdays(DOS))] 

给出错误


中出现错误[。data.table (enc.per。日,列表(patient.encounters = median(n)),:
j的列不对每个组评估一致类型:组4的结果具有列1类型整数,但期望类型double


以下一切效果良好

  tapply(enc.per.day $ n,weekdays(enc.per.day $ DOS),median)
enc.per.day [,list(patient.encounters = round(median(n)) ),by = list(weekdays(DOS))]
enc.per.day [,list(patient.encounters = median(n)+0),by = list b

发生了什么事?需要花很长时间才能知道为什么我的代码



顺便说一下底层向量enc.per.day $ n是一个整数

  storage.mode(enc.per.day $ n)

返回integer。

$;

解决方案

TL; DR wrap median as.double()



median / code>'trips up' data.table 因为---即使只传递整数向量--- median()返回一个整数值,有时返回一个double。

  ##中位数1:3是2, 
typeof(median(1:3))
#[1]integer

## 1:2的中位数为1.5,类型为double
typeof(median(1:2))
#[1]double

使用最小示例重现错误消息:

 库(data.table)
dt< - data.table (患者= c(1:3,1:2),
weekdays = c(Mon,Mon,Mon,Tue,Tue))
$ b b dt [,median(patients),by = weekdays]
#`.data.table`错误(dt,,median(patients),by = weekdays):
#对每个组评估一致的类型:
#对于组2的结果具有列1类型'double'但期望类型为整数

data.table 抱怨,因为在检查要处理的第一个组的值之后,它得出结论:OK,这些结果将是类型整数。但是,然后马上(或在你的情况下在组4),它获得传递类型double的值,这将不适合在其整数结果向量。






data.table 可以累积结果,直到按组计算结束,然后在必要时执行类型转换,但这将需要一系列额外的性能降级开销;相反,它只是报告发生了什么,让你解决这个问题。第一组运行后,它知道结果的类型,它分配一个该类型的结果向量,只要组的数量,然后填充它。如果后来发现一些组返回多于1个项目,则它将根据需要增长(即重新分配)该结果向量。但在大多数情况下, data.table 首先猜测结果的最终大小是第一次(例如,每组1行结果),因此快。 / p>

在这种情况下,使用 as.double(median(X)),而不是 median (X)提供了一个合适的修复。



(顺便说一句,你的版本使用 round c $ c>工作,因为它总是返回类型double的值,你可以看到,通过键入 typeof(round(median(1:2))); typeof )))。)


I have a data.table called enc.per.day for encounters per day. It has 2403 rows in which a date of service is specified and the number of patients seen on that day. I wanted to see the median number of patients seen on any type of weekday.

enc.per.day[,list(patient.encounters=median(n)),by=list(weekdays(DOS))]

That line gives an error

Error in [.data.table(enc.per.day, , list(patient.encounters = median(n)), : columns of j don't evaluate to consistent types for each group: result for group 4 has column 1 type 'integer' but expecting type 'double'

The following all work well

tapply(enc.per.day$n,weekdays(enc.per.day$DOS),median)
enc.per.day[,list(patient.encounters=round(median(n))),by=list(weekdays(DOS))]
enc.per.day[,list(patient.encounters=median(n)+0),by=list(weekdays(DOS))]

What is going on? It took me a long time to figure out why my code would not work.

By the way the underlying vector enc.per.day$n is an integer

storage.mode(enc.per.day$n)

returns "integer". Further, there are no NAs anywhere in the data.table.

解决方案

TL;DR wrap median with as.double()

median() 'trips up' data.table because --- even when only passed integer vectors --- median() sometimes returns an integer value, and sometimes returns a double.

## median of 1:3 is 2, of type "integer" 
typeof(median(1:3))
# [1] "integer"

## median of 1:2 is 1.5, of type "double"
typeof(median(1:2))
# [1] "double"

Reproducing your error message with a minimal example:

library(data.table)
dt <- data.table(patients = c(1:3, 1:2), 
                 weekdays = c("Mon", "Mon", "Mon", "Tue", "Tue"))

dt[,median(patients), by=weekdays]
# Error in `[.data.table`(dt, , median(patients), by = weekdays) : 
#   columns of j don't evaluate to consistent types for each group: 
#   result for group 2 has column 1 type 'double' but expecting type 'integer'

data.table complains because, after inspecting the value of the first group to be processed, it has concluded that, OK, these results are going to be of type "integer". But then right away (or in your case in group 4), it gets passed a value of type "double", which won't fit in its "integer" results vector.


data.table could instead accumulate results until the end of the group-wise calculations, and then perform type conversions if necessary, but that would require a bunch of additional performance-degrading overhead; instead, it just reports what happened and lets you fix the problem. After the first group has run, and it knows the type of the result, it allocates a result vector of that type as long as the number of groups, and then populates it. If it later finds that some groups return more than 1 item, it will grow (i.e., reallocate) that result vector as needed. In most cases though, data.table's first guess for the final size of the result is right first time (e.g., 1 row result per group) and hence fast.

In this case, using as.double(median(X)) instead of median(X) provides a suitable fix.

(By the way, your version using round() worked because it always returns values of type "double", as you can see by typing typeof(round(median(1:2))); typeof(round(median(1:3))).)

这篇关于为什么中值跳上data.table(整数对双)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆