逻辑值(布尔)中的R为何需要4个字节? [英] Why do logicals (booleans) in R require 4 bytes?

查看:215
本文介绍了逻辑值(布尔)中的R为何需要4个字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关逻辑值的向量,为什么ř分配4个字节,当位向量会消耗每个条目1位? (请参阅this问题的例子。)

For a vector of logical values, why does R allocate 4 bytes, when a bit vector would consume 1 bit per entry? (See this question for examples.)

现在,我意识到是R也有利于 NA 值的储存,但不能与一个额外的位向量做什么?换句话说,为什么不是不够的,只是使用廉价的两位数据结构?

Now, I realize that R also facilitates storage of NA values, but couldn't that be done with an additional bit vector? In other words, why isn't it enough to just use a cheap two bit data structure?

有关它的价值,MATLAB用来逻辑值1个字节,但它不利于NA值。我不知道为什么MathWorks公司并不满足于一个位的功能,更别说是两位的数据结构,但他们有一个衣着光鲜的营销... [我要牛奶二位的所有它在这个问题值得。 ; - )]

For what it's worth, Matlab uses 1 byte for logicals, though it doesn't facilitate NA values. I'm not sure why MathWorks isn't satisfied with one bit functionality, much less a two bit data structure, but they have fancy pants marketers... [I'm gonna milk "two bit" for all it's worth in this question. ;-)]

更新1.我认为,所提供的体系结构的原因做出一些感觉,但感觉有点事后。 <罢>我没有检查32位或16位R看到自己的逻辑值多大是 - 这可能在某种程度上支持了这个想法看来,从的the - [R内幕手册该逻辑向量(LGLSXP)和整数(INTSXP)是每一个平台上32位。我可以理解为整数,独立字长的通用尺寸。同样地,逻辑值的存储也似乎是独立字的大小。但它是如此之大。 :)

Update 1. I think that the architecture reasons offered make some sense, but that feels a little ex post facto. I haven't checked 32 bit or 16 bit R to see how large their logicals are - that could lend some support to the idea. It seems, from the R Internals manual that logical vectors (LGLSXP) and integers (INTSXP) are 32 bits on every platform. I can understand a universal size for integers, independent of word size. Similarly, storage of logicals also seems to be independent of word size. But it's so BIG. :)

此外,如果字长的说法是如此强大,它觉得奇怪,我看到Matlab的(我认为这是一个32位的Matlab的)仅消耗1个字节 - 我不知道是否MathWorks的选择是更多的内存效率与权衡编程的复杂性和一些其他开销找到子字对象。

In addition, if the word size argument is so powerful, it seems strange to me to see Matlab (I think it's a 32 bit Matlab) consume only 1 byte - I wonder if MathWorks chose to be more memory efficient with a tradeoff for programming complexity and some other overhead for finding sub-word objects.

此外,也肯定有其他选项:布赖恩迪格斯指出,在包便于位向量,这是上面的问题,这个问题是非常有用(由来自4个字节逻辑值位向量)将获得该任务的8X-10X加速。虽然内存访问速度是很重要的,移动的30-31额外的无信息位(从信息论的角度看)是一种浪费。例如,人们可以使用类似用于此处描述整数记忆技巧 - 在抢位级一堆额外的内存(V细胞),然后处理事情(一拉位())。为什么不(的值,1 1 NA )长期矢量做到这一点,并保存30位?

Also, there are certainly other options in are: as Brian Diggs notes, the bit package facilitates bit vectors, which was very useful for the problem in the question above (an 8X-10X speedup for the task was obtained by converting from 4 byte logical values to bit vectors). Although speed of accessing memory is important, moving 30-31 extra uninformative bits (from an information theory perspective) is wasteful. For instance, one could use something like the memory tricks used for integers described here - grab a bunch of extra memory (V cells) and then process things at the bit level (a la bit()). Why not do that and save 30 bits (1 for the value, 1 for NA) for a long vector?

要了我的内存和计算速度由布尔影响的程度,我打算切换到使用,但那是因为在太空事项节省了97%的某些情况下。 :)

To the extent that my RAM and computational speed are affected by booleans, I intend to switch over to using bit, but that's because a 97% savings in space matters in some cases. :)

我认为这个问题的答案将来自某人有R的设计还是内部有更深的了解。最好的例子是,Matlab的使用对于它们的逻辑不同的大小,和存储器字大小不会在这种情况下,答案。蟒蛇可能类似于R,为它的价值。

I think that the answer to this question will come from someone with a deeper understanding of R's design or internals. The best example is that Matlab uses a different size for their logical, and memory word sizes wouldn't be the answer in that case. Python may be similar to R, for what it's worth.

一个相关地表达,这可能是:为什么会 LGLSXP 是所有平台上的4个字节? (IS CHARSXP 通常较小,且不会这项工作呢?为什么不去更小,而刚刚超过分配?)(更新使用 CHARSXP 的想法可能是假的,因为 CHARSXP 操作并不像那些整数作为完全有用,如,使用相同的数据结构字符可能节省空间,但会限制其现有的方法可以操作它。一个更适当的考虑是使用较小的整数,如下面所讨论。)

A related way to phrase this might be: why would LGLSXP be 4 bytes on all platforms? (Is CHARSXP typically smaller, and wouldn't that work as well? Why not go even smaller, and just over-allocate?) (Updated The idea of using CHARSXP is likely bogus, because operations on CHARSXP aren't as fully useful as those for integers, such as sum. Using the same data structure as characters might save space, but would constrain which existing methods could operate on it. A more appropriate consideration is the use of smaller integers, as discussed below.)

更新2.已经有慕名相对于一些非常好的和启发性的答案,怎么一的的实施布尔速度和编程效率的目标的检索和处理。我认为,汤米的答案是特别似是而非有关的为什么的出现这样的R,这似乎从2 premises至出现:

Update 2. There have been some very good and enlightening answers here, especially relative to how one should implement retrieval and processing of booleans for the goals of speed and programming efficiency. I think that Tommy's answer is particularly plausible regarding the why it appears this way in R, which seems to arise from 2 premises:


  1. 为了支持附加在逻辑矢量(请注意,逻辑是由编程语言/环境中定义,并且是不一样的一个布尔值),一个是最好通过重用code表示送达加入整数。在R的情况下,整体消耗4个字节。在Matlab的情况下,最小整数是1字节(即 INT8 )。这可以解释为什么不同的东西将是一个滋扰写逻辑值。 [对于那些不熟悉R,它支持在逻辑值很多数值运算,如总和(myVector)平均值(myVector)等]

  1. In order to support addition on a logical vector (note that "logical" is defined by programming language / environment, and is not the same as a boolean), one is best served by reusing code for adding integers. In the case of R, integers consume 4 bytes. In the case of Matlab, the smallest integer is 1 byte (i.e. int8). This would explain why something different would be a nuisance to write for logicals. [To those not familiar with R, it supports many numerical operations on logicals, such as sum(myVector), mean(myVector), etc.]

传统支持使得它非常困难的事比什么在R和S-PLUS已经做了很长时间,现在其他的东西。此外,我怀疑,在S,S-Plus和R的初期,如果有人做了很多的布尔运算,他们做了他们在C,而不是试图做R中逻辑值这么多的工作。

Legacy support makes it exceedingly difficult to do something other than what has been done in R and S-Plus for a long time now. Moreover, I suspect that in the early days of S, S-Plus, and R, if someone was doing a lot of boolean operations, they did them in C, rather than trying to do so much work with logicals in R.

其他答案是梦幻般的一个如何实现更好的布尔处理的目的 - 不要天真地以为,人们可以在任何单个位得到:这是最有效的方式装载一个字,则掩盖了是不是位兴趣,Dervall形容。这是非常,非常有用的意见应在专门写code代表的R布尔操作(例如,我的交叉表格的问题):不要过度位重复,而是在词汇层面的工作。

The other answers are fantastic for the purposes of how one might implement better boolean handling - don't naively assume that one can get at any individual bit: it's most efficient to load a word, then mask the bits that are not of interest, as Dervall has described. This is very, very useful advice should one write specialized code for boolean manipulation for R (e.g. my question on cross tabulations): don't iterate over bits, but instead work at the word level.

感谢所有为一个非常彻底的一套答案和见解。

Thanks to all for a very thorough set of answers and insights.

推荐答案

知道一点关于R和S-PLUS,我想说的是R很可能是因为它与S-Plus支持,和S-PLUS最有可能这样做是因为它是最容易做的事情...

Knowing a little something about R and S-Plus, I'd say that R most likely did it to be compatible with S-Plus, and S-Plus most likely did it because it was the easiest thing to do...

基本上,一个合乎逻辑的矢量是相同的整数向量,所以对于整数等算法的工作pretty逻辑向量多不变。

Basically, a logical vector is identical to an integer vector, so sum and other algorithms for integers work pretty much unchanged on logical vectors.

在64位的S-另外,这个整数是64位,因此也逻辑矢量!这是的 8字节的每个逻辑值...

In 64-bit S-Plus, the integers are 64-bit and thus also the logical vectors! That's 8 bytes per logical value...

@Iterator当然是正确的,一个逻辑向量的的重新以更紧凑的形式psented $ P $。既然已经有一个为1字节的原始矢量类型,它似乎是一个很简单的改变就用它的逻辑值了。每值2位当然会更好 - 我可能会保持它们作为两个独立的位向量(TRUE / ​​FALSE和NA /有效)和NA位,如果没有新来港矢量可以为空...

@Iterator is of course correct that a logical vector should be represented in a more compact form. Since there is already a raw vector type which is 1-byte, it would seem like a very simple change to use that one for logicals too. And 2 bits per value would of course be even better - I'd probably keep them as two separate bit vectors (TRUE/FALSE and NA/Valid), and the NA bit vector could be NULL if there are no NAs...

总之,这主要是一个梦想,因为有这么多的RAPI包(包使用的R C / FORTRAN API)的,在那里,将打破...

Anyway, that's mostly a dream since there are so many RAPI packages (packages that use the R C/FORTRAN APIs) out there that would break...

这篇关于逻辑值(布尔)中的R为何需要4个字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆