R中字符的对象大小-R全局字符串池如何工作? [英] Object size for characters in R - How does R global string pool work?

查看:76
本文介绍了R中字符的对象大小-R全局字符串池如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Hadley的Advanced R编程,当它讨论字符的内存大小时会说:

I am reading Hadley's Advanced R Programming and when it discusses the memory size for characters it says this:

R有一个全局字符串池.这意味着每个唯一的字符串仅 存储在一个地方,因此字符向量占用较少 记忆力超出您的预期.

R has a global string pool. This means that each unique string is only stored in one place, and therefore character vectors take up less memory than you might expect.

这本书给出的例子是这样的:

The example the book gives is this:

library(pryr)
object_size("banana")
#> 96 B
object_size(rep("banana", 10))
#> 216 B

本节中的练习之一是比较这两个字符向量:

One of the exercises in this section is to compare these two character vectors:

vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
str <- lapply(vec, paste0, collapse = "")

object_size(vec)
13.4 kB

object_size(str)
8.74 kB

现在,由于段落指出R具有全局字符串池,并且由于向量vec主要由两个字符串("ba"和"na")的重复组成,因此我实际上-直观上-期望vec小于str的大小.

Now, since the passage states that R has a global string pool, and since vector vec is composed mainly of repetitions of two strings ("ba" and "na") I actually would - intuitively - expect the size of vec to be smaller than the size of str.

所以我的问题是:您如何才能最准确地预先估算这些向量的大小?

So my question is: how could you most accurately estimate the size of those vectors beforehand?

推荐答案

主要区别在于vec中的指针:每个短标量字符串(CHARSXPs)都必须从相应的字符串向量(STRSXP)指向).在vec中有大约1326个这样的字符串指针,但在str中只有51个(指针在您的平台上可能是8个字节).该池用于标量字符串(也称为CHARSXP缓存).另一个非显而易见的因素是内部碎片,例如在我的系统上,标量字符串的大小相同,而不管它是否为零到7个字符,而8字符串则只需要更多,依此类推.请参阅以下重复的尺寸:

The key difference is because of the pointers in vec: each of the short scalar strings (CHARSXPs) has to be pointed from the corresponding string vector (STRSXP). You have some 1326 of such string pointers inside vec, but only 51 in str (a pointer is probably 8 bytes on your platform). The pool is for scalar strings (aka CHARSXP cache). Another non-obvious factor is internal fragmentation, e.g. on my system, a scalar string takes the same size regardless of whether it has zero to 7 characters, an 8 character string only takes more, and so on. See the repeated sizes in the following:

unlist(sapply(str, object.size))

[1] 96 96 96 104 104 104 104 120 120 120 120 120 120 120 120 136 136 136 136

[1] 96 96 96 104 104 104 104 120 120 120 120 120 120 120 120 136 136 136 136

[20] 136 136 136 136 152 152 152 152 152 152 152 152 152 216 216 216 216 216 216 216

[20] 136 136 136 136 152 152 152 152 152 152 152 152 216 216 216 216 216 216 216

[39] 216216216216216216216216216216216216216216216

[39] 216 216 216 216 216 216 216 216 216 216 216 216 216

但是,这些是R的内存管理器的实现细节,可以更改,并且在用户程序中不应以任何方式依赖它们-使用另一种对象布局/内存管理器,str可以比vec使用更多的空间.

These are, however, implementation details of R's memory manager that could change and one should not depend on them in any way in user programs - with another object layout/memory manager, str could use more space than vec.

这篇关于R中字符的对象大小-R全局字符串池如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆