如何在不丢失信息的情况下将因子转换为整数数字? [英] How to convert a factor to integer umeric without loss of information?
问题描述
当我将一个因子转换为数字或整数时,我得到的是底层代码,而不是数字形式的值.
When I convert a factor to a numeric or integer, I get the underlying level codes, not the values as numbers.
f <- factor(sample(runif(5), 20, replace = TRUE))
## [1] 0.0248644019011408 0.0248644019011408 0.179684827337041
## [4] 0.0284090070053935 0.363644931698218 0.363644931698218
## [7] 0.179684827337041 0.249704354675487 0.249704354675487
## [10] 0.0248644019011408 0.249704354675487 0.0284090070053935
## [13] 0.179684827337041 0.0248644019011408 0.179684827337041
## [16] 0.363644931698218 0.249704354675487 0.363644931698218
## [19] 0.179684827337041 0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218
as.numeric(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
as.integer(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
我不得不求助于paste
来获得真正的价值:
I have to resort to paste
to get the real values:
as.numeric(paste(f))
## [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
## [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901
是否有更好的方法将因子转换为数字?
Is there a better way to convert a factor to numeric?
推荐答案
请参阅 ?factor
:
特别是,as.numeric
应用于一个因素是没有意义的,并且可能通过隐式强制发生.到将因子 f
转换为大约是它的原始数字值,as.numeric(levels(f))[f]
是推荐,略多效率比as.numeric(as.character(f))
.
In particular,
as.numeric
applied to a factor is meaningless, and may happen by implicit coercion. To transform a factorf
to approximately its original numeric values,as.numeric(levels(f))[f]
is recommended and slightly more efficient thanas.numeric(as.character(f))
.
关于 R 的常见问题 有类似的建议.
The FAQ on R has similar advice.
为什么 as.numeric(levels(f))[f]
比 as.numeric(as.character(f))
更有效?
Why is as.numeric(levels(f))[f]
more efficent than as.numeric(as.character(f))
?
as.numeric(as.character(f))
实际上是 as.numeric(levels(f)[f])
,因此您正在执行转换为length(x)
值上的数字,而不是 nlevels(x)
值上的数字.对于具有较少级别的长向量,速度差异最为明显.如果值大多是唯一的,则速度不会有太大差异.不管你如何转换,这个操作都不太可能成为你代码的瓶颈,所以不要太担心.
as.numeric(as.character(f))
is effectively as.numeric(levels(f)[f])
, so you are performing the conversion to numeric on length(x)
values, rather than on nlevels(x)
values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won't be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don't worry too much about it.
一些时间
library(microbenchmark)
microbenchmark(
as.numeric(levels(f))[f],
as.numeric(levels(f)[f]),
as.numeric(as.character(f)),
paste0(x),
paste(x),
times = 1e5
)
## Unit: microseconds
## expr min lq mean median uq max neval
## as.numeric(levels(f))[f] 3.982 5.120 6.088624 5.405 5.974 1981.418 1e+05
## as.numeric(levels(f)[f]) 5.973 7.111 8.352032 7.396 8.250 4256.380 1e+05
## as.numeric(as.character(f)) 6.827 8.249 9.628264 8.534 9.671 1983.694 1e+05
## paste0(x) 7.964 9.387 11.026351 9.956 10.810 2911.257 1e+05
## paste(x) 7.965 9.387 11.127308 9.956 11.093 2419.458 1e+05
这篇关于如何在不丢失信息的情况下将因子转换为整数数字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!