R中字符的对象大小-R全局字符串池如何工作? [英] Object size for characters in R - How does R global string pool work?
问题描述
我正在阅读Hadley的Advanced R编程,当它讨论字符的内存大小时会说:
I am reading Hadley's Advanced R Programming and when it discusses the memory size for characters it says this:
R有一个全局字符串池.这意味着每个唯一的字符串仅 存储在一个地方,因此字符向量占用较少 记忆力超出您的预期.
R has a global string pool. This means that each unique string is only stored in one place, and therefore character vectors take up less memory than you might expect.
这本书给出的例子是这样的:
The example the book gives is this:
library(pryr)
object_size("banana")
#> 96 B
object_size(rep("banana", 10))
#> 216 B
本节中的练习之一是比较这两个字符向量:
One of the exercises in this section is to compare these two character vectors:
vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
str <- lapply(vec, paste0, collapse = "")
object_size(vec)
13.4 kB
object_size(str)
8.74 kB
现在,由于段落指出R具有全局字符串池,并且由于向量vec
主要由两个字符串("ba"和"na")的重复组成,因此我实际上-直观上-期望vec
小于str
的大小.
Now, since the passage states that R has a global string pool, and since vector vec
is composed mainly of repetitions of two strings ("ba" and "na") I actually would - intuitively - expect the size of vec
to be smaller than the size of str
.
所以我的问题是:您如何才能最准确地预先估算这些向量的大小?
So my question is: how could you most accurately estimate the size of those vectors beforehand?
推荐答案
主要区别在于vec
中的指针:每个短标量字符串(CHARSXPs)都必须从相应的字符串向量(STRSXP)指向).在vec
中有大约1326个这样的字符串指针,但在str
中只有51个(指针在您的平台上可能是8个字节).该池用于标量字符串(也称为CHARSXP缓存).另一个非显而易见的因素是内部碎片,例如在我的系统上,标量字符串的大小相同,而不管它是否为零到7个字符,而8字符串则只需要更多,依此类推.请参阅以下重复的尺寸:
The key difference is because of the pointers in vec
: each of the short scalar strings (CHARSXPs) has to be pointed from the corresponding string vector (STRSXP). You have some 1326 of such string pointers inside vec
, but only 51 in str
(a pointer is probably 8 bytes on your platform). The pool is for scalar strings (aka CHARSXP cache). Another non-obvious factor is internal fragmentation, e.g. on my system, a scalar string takes the same size regardless of whether it has zero to 7 characters, an 8 character string only takes more, and so on. See the repeated sizes in the following:
unlist(sapply(str, object.size))
[1] 96 96 96 104 104 104 104 120 120 120 120 120 120 120 120 136 136 136 136
[1] 96 96 96 104 104 104 104 120 120 120 120 120 120 120 120 136 136 136 136
[20] 136 136 136 136 152 152 152 152 152 152 152 152 152 216 216 216 216 216 216 216
[20] 136 136 136 136 152 152 152 152 152 152 152 152 216 216 216 216 216 216 216
[39] 216216216216216216216216216216216216216216216
[39] 216 216 216 216 216 216 216 216 216 216 216 216 216
但是,这些是R的内存管理器的实现细节,可以更改,并且在用户程序中不应以任何方式依赖它们-使用另一种对象布局/内存管理器,str
可以比vec
使用更多的空间.
These are, however, implementation details of R's memory manager that could change and one should not depend on them in any way in user programs - with another object layout/memory manager, str
could use more space than vec
.
这篇关于R中字符的对象大小-R全局字符串池如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!