为小整数或因子少的级数节省存储空间 [英] Save storage space for small integers or factors with few levels

查看:67
本文介绍了为小整数或因子少的级数节省存储空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

R似乎需要为每个整数存储四个字节,即使是小的整数也是如此:

R seems to require four bytes of storage per integer, even for small ones:

> object.size(rep(1L, 10000))
40040 bytes

而且,甚至对于因素:

> object.size(factor(rep(1L, 10000)))
40456 bytes

我认为,尤其是在后一种情况下,可以更好地解决这一问题.是否有解决方案可以帮助我将这种情况下的存储需求减少到每行八位甚至两位?也许使用 raw 在内部进行存储类型,但在其他方面则类似于正常因素. bit 包提供了此功能,但我没有发现因素相似.

I think, especially in the latter case this could be handled much better. Is there a solution that would help me reduce the storage requirements for this case to eight or even two bits per row? Perhaps a solution that uses the raw type internally for storage but behaves like a normal factor otherwise. The bit package offers this for bits, but I haven't found anything similar for factors.

我仅有几百万行的数据帧正在消耗千兆字节,这是对内存和运行时间的巨大浪费(!).压缩会减少所需的磁盘空间,但又会浪费运行时间.

My data frame with just a few millions of rows is consuming gigabytes, and that's a huge waste of memory and run time (!). Compression will reduce the required disk space, but again at the expense of run time.

相关:

  • Why do logicals (booleans) in R require 4 bytes?
  • How can I efficiently construct a very long factor with few levels?

推荐答案

由于您提到raw(并且假设因子水平小于256),因此可以进行先决条件转换操作内存是您的瓶颈,而CPU时间则不是.例如:

Since you mention raw (and assuming you have less than 256 factor levels) - you could do the prerequisite conversion operations if memory is your bottleneck and CPU time isn't. For example:

f = factor(rep(1L, 1e5))
object.size(f)
# 400456 bytes

f.raw = as.raw(f)
object.size(f.raw)
#100040 bytes

# to go back:
identical(as.factor(as.integer(f.raw)), f)
#[1] TRUE

您也可以单独保存因子水平,如果您对此感兴趣,可以恢复因子水平,但是就分组和所有操作而言,您只需使用raw即可完成所有操作,并且永远不会回到因子(除了演示文稿之外.)

You can also save the factor levels separately and recover them if that's something you're interested in doing, but as far as grouping and all that goes you can just do it all with raw and never go back to factors (except for presentation).

如果您在使用此方法时遇到麻烦的特定用例,请发布它,否则我认为这应该很好.

If you have specific use cases where you have trouble with this method, please post it, otherwise I think this should work just fine.

这是您的byte.factor类的起点:

byte.factor = function(f) {
  res = as.raw(f)
  attr(res, "levels") <- levels(f)
  attr(res, "class") <- "byte.factor"
  res
}

as.factor.byte.factor = function(b) {
  factor(attributes(b)$levels[as.integer(b)], attributes(b)$levels)
}

因此您可以执行以下操作:

So you can do things like:

f = factor(c('a','b'), letters)
f
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

b = byte.factor(f)
b
#[1] 01 02
#attr(,"levels")
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#[20] "t" "u" "v" "w" "x" "y" "z"
#attr(,"class")
#[1] "byte.factor"

as.factor.byte.factor(b)
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

如果要使as.factor通用,请查看data.table如何覆盖rbind.data.frame的方法,只需添加要添加的任何功能即可.一切都应该很简单.

Check out how data.table overrides rbind.data.frame if you want to make as.factor generic and just add whatever functions you want to add. Should all be quite straightforward.

这篇关于为小整数或因子少的级数节省存储空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆