为什么 R 中的逻辑(布尔值)需要 4 个字节? [英] Why do logicals (booleans) in R require 4 bytes?

查看:16
本文介绍了为什么 R 中的逻辑(布尔值)需要 4 个字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于逻辑值向量,为什么 R 分配 4 个字节,而位向量每个条目会消耗 1 位?(有关示例,请参阅这个问题.)

For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)

现在,我意识到 R 还有助于存储 NA 值,但不能通过额外的位向量来完成吗?换句话说,为什么仅仅使用廉价的两位数据结构还不够?

Now, I realize that R also facilitates storage of NA values, but couldn't that be done with an additional bit vector? In other words, why isn't it enough to just use a cheap two bit data structure?

不管怎样,Matlab 使用 1 个字节来进行逻辑运算,尽管它不利于 NA 值.我不确定为什么 MathWorks 对一位功能不满意,更不用说两位数据结构了,但他们有花哨的裤子营销人员.......;-)]

For what it's worth, Matlab uses 1 byte for logicals, though it doesn't facilitate NA values. I'm not sure why MathWorks isn't satisfied with one bit functionality, much less a two bit data structure, but they have fancy pants marketers... [I'm gonna milk "two bit" for all it's worth in this question. ;-)]

更新 1.我认为所提供的架构原因是有道理的,但这有点事后的感觉.我还没有检查 32 位或 16 位 R 以了解它们的逻辑有多大 - 这可能会为这个想法提供一些支持. 似乎来自 R 内部手册 逻辑向量 (LGLSXP) 和整数 (INTSXP) 在每个平台上都是 32 位.我可以理解整数的通用大小,与字长无关.同样,逻辑的存储似乎也与字长无关.但它是如此之大.:)

Update 1. I think that the architecture reasons offered make some sense, but that feels a little ex post facto. I haven't checked 32 bit or 16 bit R to see how large their logicals are - that could lend some support to the idea. It seems, from the R Internals manual that logical vectors (LGLSXP) and integers (INTSXP) are 32 bits on every platform. I can understand a universal size for integers, independent of word size. Similarly, storage of logicals also seems to be independent of word size. But it's so BIG. :)

此外,如果 word size 参数如此强大,我觉得 Matlab(我认为它是 32 位 Matlab)只消耗 1 个字节似乎很奇怪 - 我想知道 MathWorks 是否选择通过权衡来提高内存效率用于编程复杂性和查找子词对象的一些其他开销.

In addition, if the word size argument is so powerful, it seems strange to me to see Matlab (I think it's a 32 bit Matlab) consume only 1 byte - I wonder if MathWorks chose to be more memory efficient with a tradeoff for programming complexity and some other overhead for finding sub-word objects.

此外,当然还有其他选项:正如 Brian Diggs 所指出的,bit 包促进了位向量,这对于上述问题中的问题非常有用(8X-10X 加速该任务是通过将 4 字节 logical 值转换为位向量来获得的).尽管访问内存的速度很重要,但移动 30-31 个额外的无信息位(从信息论的角度来看)是浪费的.例如,可以使用类似用于整数的记忆技巧这里描述- 获取一堆额外的内存(V 个单元),然后在位级别处理事物(例如 bit()).为什么不这样做并为长向量节省 30 位(1 表示值,1 表示 NA)?

Also, there are certainly other options in are: as Brian Diggs notes, the bit package facilitates bit vectors, which was very useful for the problem in the question above (an 8X-10X speedup for the task was obtained by converting from 4 byte logical values to bit vectors). Although speed of accessing memory is important, moving 30-31 extra uninformative bits (from an information theory perspective) is wasteful. For instance, one could use something like the memory tricks used for integers described here - grab a bunch of extra memory (V cells) and then process things at the bit level (a la bit()). Why not do that and save 30 bits (1 for the value, 1 for NA) for a long vector?

由于我的 RAM 和计算速度受布尔值影响,我打算改用 bit,但这是因为在某些情况下节省 97% 的空间很重要.:)

To the extent that my RAM and computational speed are affected by booleans, I intend to switch over to using bit, but that's because a 97% savings in space matters in some cases. :)

我认为这个问题的答案将来自对 R 的设计或内部结构有更深入了解的人.最好的例子是 Matlab 对它们的逻辑使用不同的大小,在这种情况下,内存字大小不是答案.Python 可能类似于 R,因为它的价值.

I think that the answer to this question will come from someone with a deeper understanding of R's design or internals. The best example is that Matlab uses a different size for their logical, and memory word sizes wouldn't be the answer in that case. Python may be similar to R, for what it's worth.

一种相关的表达方式可能是:为什么 LGLSXP 在所有平台上都是 4 字节?(CHARSXP 通常会更小吗,那不是很好用吗?为什么不更小,只是过度分配?)(更新 使用 CHARSXP 可能是伪造的,因为对 CHARSXP 的操作不如对整数的操作有用,例如 sum.使用与字符相同的数据结构可能会节省空间,但会限制哪些现有方法可以对其进行操作.更合适的考虑是使用更小的整数,如下所述.)

A related way to phrase this might be: why would LGLSXP be 4 bytes on all platforms? (Is CHARSXP typically smaller, and wouldn't that work as well? Why not go even smaller, and just over-allocate?) (Updated The idea of using CHARSXP is likely bogus, because operations on CHARSXP aren't as fully useful as those for integers, such as sum. Using the same data structure as characters might save space, but would constrain which existing methods could operate on it. A more appropriate consideration is the use of smaller integers, as discussed below.)

更新 2.这里有一些非常好的和启发性的答案,特别是关于一个应该为了速度和编程效率的目标而实现布尔值的检索和处理.我认为汤米的回答对于它在 R 中以这种方式出现的 为什么 特别合理,这似乎来自两个前提:

Update 2. There have been some very good and enlightening answers here, especially relative to how one should implement retrieval and processing of booleans for the goals of speed and programming efficiency. I think that Tommy's answer is particularly plausible regarding the why it appears this way in R, which seems to arise from 2 premises:

  1. 为了支持逻辑向量的加法(请注意,逻辑"是由编程语言/环境定义的,与布尔值不同),最好通过重用代码来添加整数.在 R 的情况下,整数消耗 4 个字节.在 Matlab 的情况下,最小的整数是 1 个字节(即 int8).这可以解释为什么不同的东西写逻辑会令人讨厌.[对不熟悉R的人来说,它支持许多逻辑上的数值运算,例如sum(myVector)mean(myVector)等]

  1. In order to support addition on a logical vector (note that "logical" is defined by programming language / environment, and is not the same as a boolean), one is best served by reusing code for adding integers. In the case of R, integers consume 4 bytes. In the case of Matlab, the smallest integer is 1 byte (i.e. int8). This would explain why something different would be a nuisance to write for logicals. [To those not familiar with R, it supports many numerical operations on logicals, such as sum(myVector), mean(myVector), etc.]

传统的支持使得除了在 R 和 S-Plus 中已经做了很长时间的事情之外,做其他事情变得非常困难.此外,我怀疑在 S、S-Plus 和 R 的早期,如果有人在做很多布尔运算,他们会在 C 中进行,而不是尝试在 R 中使用逻辑来做这么多的工作.

Legacy support makes it exceedingly difficult to do something other than what has been done in R and S-Plus for a long time now. Moreover, I suspect that in the early days of S, S-Plus, and R, if someone was doing a lot of boolean operations, they did them in C, rather than trying to do so much work with logicals in R.

其他答案对于如何实现更好的布尔处理来说非常棒 - 不要天真地假设一个人可以获取任何单个位:加载一个单词,然后屏蔽不属于的位是最有效的正如德瓦尔所描述的那样.如果有人为 R 的布尔操作编写专门的代码(例如我关于交叉表的问题),这是非常非常有用的建议:不要迭代位,而是在字级别工作.

The other answers are fantastic for the purposes of how one might implement better boolean handling - don't naively assume that one can get at any individual bit: it's most efficient to load a word, then mask the bits that are not of interest, as Dervall has described. This is very, very useful advice should one write specialized code for boolean manipulation for R (e.g. my question on cross tabulations): don't iterate over bits, but instead work at the word level.

感谢大家提供非常全面的答案和见解.

Thanks to all for a very thorough set of answers and insights.

推荐答案

稍微了解一下 R 和 S-Plus,我会说 R 很可能是为了兼容 S-Plus 和 S-Plus很可能是因为这是最容易做的事情......

Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...

基本上,逻辑向量与整数向量相同,因此 sum 和其他整数算法在逻辑向量上的工作几乎没有变化.

Basically, a logical vector is identical to an integer vector, so sum and other algorithms for integers work pretty much unchanged on logical vectors.

在 64 位 S-Plus 中,整数是 64 位的,因此也是逻辑向量!这是每个逻辑值 8 个字节...

In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's 8 bytes per logical value...

@Iterator 当然是正确的,逻辑向量应该以更紧凑的形式表示.由于已经有一个 1 字节的 raw 向量类型,因此将其用于逻辑似乎也是一个非常简单的更改.每个值 2 位当然会更好 - 我可能会将它们保留为两个单独的位向量(TRUE/FALSE 和 NA/Valid),如果没有 NA,NA 位向量可能为 NULL...

@Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a raw vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...

无论如何,这主要是一个梦想,因为有太多 RAPI 包(使用 R C/FORTRAN API 的包)会破坏......

Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...

这篇关于为什么 R 中的逻辑(布尔值)需要 4 个字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆