为什么中位数跳闸 data.table (整数与双精度)? [英] Why does median trip up data.table (integer versus double)?

查看:21
本文介绍了为什么中位数跳闸 data.table (整数与双精度)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为 enc.per.day 的 data.table 用于每天的遭遇.它有 2403 行,其中指定了服务日期和当天就诊的患者人数.我想查看在任何类型的工作日看到的患者的中位数.

I have a data.table called enc.per.day for encounters per day. It has 2403 rows in which a date of service is specified and the number of patients seen on that day. I wanted to see the median number of patients seen on any type of weekday.

enc.per.day[,list(patient.encounters=median(n)),by=list(weekdays(DOS))]

该行给出错误

[.data.table(enc.per.day, , list(patient.encounters = median(n)), 中的错误:j 的列不评估为每个组的一致类型:第 4 组的结果具有列 1 类型整数"但期望类型双"

Error in [.data.table(enc.per.day, , list(patient.encounters = median(n)), : columns of j don't evaluate to consistent types for each group: result for group 4 has column 1 type 'integer' but expecting type 'double'

以下都运行良好

tapply(enc.per.day$n,weekdays(enc.per.day$DOS),median)
enc.per.day[,list(patient.encounters=round(median(n))),by=list(weekdays(DOS))]
enc.per.day[,list(patient.encounters=median(n)+0),by=list(weekdays(DOS))]

发生了什么?我花了很长时间才弄明白为什么我的代码不起作用.

What is going on? It took me a long time to figure out why my code would not work.

顺便说一下,底层向量 enc.per.day$n 是一个整数

By the way the underlying vector enc.per.day$n is an integer

storage.mode(enc.per.day$n)

返回整数".此外,data.table 中的任何地方都没有 NA.

returns "integer". Further, there are no NAs anywhere in the data.table.

推荐答案

TL;DR wrap median with as.double()

TL;DR wrap median with as.double()

median() 'trips up' data.table 因为 --- 即使只传递整数向量 --- median()有时返回整数值,有时返回双精度值.

median() 'trips up' data.table because --- even when only passed integer vectors --- median() sometimes returns an integer value, and sometimes returns a double.

## median of 1:3 is 2, of type "integer" 
typeof(median(1:3))
# [1] "integer"

## median of 1:2 is 1.5, of type "double"
typeof(median(1:2))
# [1] "double"

用一个最小的例子重现你的错误信息:

Reproducing your error message with a minimal example:

library(data.table)
dt <- data.table(patients = c(1:3, 1:2), 
                 weekdays = c("Mon", "Mon", "Mon", "Tue", "Tue"))

dt[,median(patients), by=weekdays]
# Error in `[.data.table`(dt, , median(patients), by = weekdays) : 
#   columns of j don't evaluate to consistent types for each group: 
#   result for group 2 has column 1 type 'double' but expecting type 'integer'

data.table 抱怨是因为,在检查了要处理的第一组的值之后,它得出的结论是,好的,这些结果将是整数"类型.但是马上(或者在你的第 4 组中),它被传递了一个double"类型的值,它不适合它的整数"结果向量.

data.table complains because, after inspecting the value of the first group to be processed, it has concluded that, OK, these results are going to be of type "integer". But then right away (or in your case in group 4), it gets passed a value of type "double", which won't fit in its "integer" results vector.

data.table 可以改为累积结果,直到分组计算结束,然后在必要时执行类型转换,但这需要大量额外的降低性能的开销;相反,它只是报告发生的事情并让您解决问题.在第一个组运行后,它知道结果的类型,它分配一个该类型的结果向量,只要组的数量,然后填充它.如果它稍后发现某些组返回超过 1 个项目,它将根据需要增长(即重新分配)该结果向量.但在大多数情况下,data.table 对结果最终大小的第一次猜测是第一次正确(例如,每组 1 行结果),因此速度很快.

data.table could instead accumulate results until the end of the group-wise calculations, and then perform type conversions if necessary, but that would require a bunch of additional performance-degrading overhead; instead, it just reports what happened and lets you fix the problem. After the first group has run, and it knows the type of the result, it allocates a result vector of that type as long as the number of groups, and then populates it. If it later finds that some groups return more than 1 item, it will grow (i.e., reallocate) that result vector as needed. In most cases though, data.table's first guess for the final size of the result is right first time (e.g., 1 row result per group) and hence fast.

在这种情况下,使用 as.double(median(X)) 代替 median(X) 可以提供合适的解决方法.

In this case, using as.double(median(X)) instead of median(X) provides a suitable fix.

(顺便说一句,您使用 round() 的版本可以正常工作,因为它总是返回double"类型的值,您可以通过键入 typeof(round(median(1:2))); typeof(round(median(1:3))).)

(By the way, your version using round() worked because it always returns values of type "double", as you can see by typing typeof(round(median(1:2))); typeof(round(median(1:3))).)

这篇关于为什么中位数跳闸 data.table (整数与双精度)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆