在R中交叉制表两个大量逻辑向量的最快方法 [英] Fastest way to cross-tabulate two massive logical vectors in R

查看:169
本文介绍了在R中交叉制表两个大量逻辑向量的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于两个长度大于1E8的逻辑向量xy,计算2x2交叉表的最快方法是什么?

我怀疑答案是用C/C ++编写,但是我想知道R中是否存在对这个问题已经很聪明的东西,因为这并不罕见.

示例代码,用于3亿个条目(如果3E8太大,请让N = 1E8;我选择的总大小略低于2.5GB(2.4GB).我的目标密度为0.02,只是为了使其更有趣(如果可以的话,可以使用稀疏向量,但是类型转换可能需要一些时间.)

set.seed(0)
N = 3E8
p = 0.02
x = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

一些明显的方法:

  1. table
  2. bigtabulate
  3. 简单的逻辑运算(例如sum(x & y))
  4. 向量乘法(boo)
  5. data.table
  6. 以上所述,其中parallel来自multicore包(或新的parallel包)

我在前三个选项中都采取了行动(请参阅我的答案),但我认为必须有更好,更快的产品.

我发现table的运行非常缓慢. bigtabulate对于一对逻辑向量似乎有点过大.最后,执行原始逻辑运算似乎很麻烦,而且每个向量看得太多次了(3X?7X?),更不用说它在处理过程中填满了很多额外的内存,这是浪费大量的时间./p>

向量乘法通常不是一个好主意,但是当向量稀疏时,可能会从存储它本身然后再使用向量乘法中获得好处.

可以随意更改Np,如果这将证明制表函数的任何有趣行为. :)


更新1.我的第一个答案给出了三种朴素方法的计时,这是相信table缓慢的基础.但是,要实现的关键是逻辑"方法效率极低.看看它在做什么:

  • 4个逻辑向量运算
  • 4种类型转换(从逻辑到整数或FP-对于sum)
  • 4个矢量求和
  • 8个分配(1个用于逻辑运算,1个用于求和)

不仅如此,它甚至没有被编译或并行化.但是,它仍然胜过table.请注意,使用进行了额外类型转换的bigtabulate (1 * cbind...)仍然胜过table.

更新2.为了避免有人指出R中的逻辑向量支持NA,并且这将是这些交叉表在系统中的扳手(在大多数情况下是正确的),我应该指出我的向量来自is.na()is.finite(). :)我一直在调试NA和其他非限定值-y视为垃圾输出")因此,问题是由代码引起的输出问题仅是那些数据异常的情况,还是其他一些情况下好数据变坏的情况?(这就是为什么我问一个关于<一个href ="https://stackoverflow.com/questions/9167376/how-to-force-an-error-if-non-finite-values-na-nan-or-inf-are-encountered">在发生以下情况时停止遇到NaNNAInf .)

这也解释了为什么我的示例对TRUE值的可能性较低;实际上,发生这种情况的时间远远少于0.1%.

这是否建议使用其他解决方案?是的:这表明我们可以使用两个索引(即每个集合中TRUE的位置)并计算集合的交集.我避免了集合交集,因为我被Matlab烧了一会儿(是的,这是R,但要忍受我),因为它会先对集合中的元素进行排序,然后再进行交集. (我隐约记得,复杂性甚至更令人尴尬:就像O(n^2)而不是O(n log n).)

解决方案

如果您要对巨大的逻辑矢量进行大量操作,请查看

For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?

I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon.

Example code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big; I chose a total size just under 2.5GB (2.4GB). I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that helps, but type conversion can take time).

set.seed(0)
N = 3E8
p = 0.02
x = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

Some obvious methods:

  1. table
  2. bigtabulate
  3. Simple logical operations (e.g. sum(x & y))
  4. Vector multiplication (boo)
  5. data.table
  6. Some of the above, with parallel from the multicore package (or the new parallel package)

I've taken a stab at the first three options (see my answer), but I feel that there must be something better and faster.

I find that table works very slowly. bigtabulate seems like overkill for a pair of logical vectors. Finally, doing the vanilla logical operations seems like a kludge, and it looks at each vector too many times (3X? 7X?), not to mention that it fills a lot of additional memory during processing, which is a massive time waster.

Vector multiplication is usually a bad idea, but when the vector is sparse, one may get an advantage out of storing it as such, and then using vector multiplication.

Feel free to vary N and p, if that will demonstrate anything interesting behavior of the tabulation functions. :)


Update 1. My first answer gives timings on three naive methods, which is the basis for believing table is slow. Yet, the key thing to realize is that the "logical" method is grossly inefficient. Look at what it's doing:

  • 4 logical vector operations
  • 4 type conversions (logical to integer or FP - for sum)
  • 4 vector summations
  • 8 assignments (1 for the logical operation, 1 for the summation)

Not only that, but it's not even compiled or parallelized. Yet, it still beats the pants off of table. Notice that bigtabulate, with an extra type conversion (1 * cbind...) still beats table.

Update 2. Lest anyone point out that logical vectors in R support NA, and that that will be a wrench in the system for these cross tabulations (which is true in most cases), I should point out that my vectors come from is.na() or is.finite(). :) I've been debugging NA and other non-finite values - they've been a headache for me recently. If you don't know whether or not all of your entries are NA, you could test with any(is.na(yourVector)) - this would be wise before you adopt some of the ideas arising in this Q&A.


Update 3. Brandon Bertelsen asked a very reasonable question in the comments: why use so much data when a sub-sample (the initial set, after all, is a sample ;-)) might be adequate for the purposes of creating a cross-tabulation? Not to drift too far into statistics, but the data arises from cases where the TRUE observations are very rare, for both variables. One is a result of a data anomaly, the other due to a possible bug in code (possible bug because we only see the computational result - think of variable x as "Garbage In", and y as "Garbage Out". AS a result, the question is whether the issues in the output caused by the code are solely those cases where the data is anomalous, or are there some other instances where good data goes bad? (This is why I asked a question about stopping when a NaN, NA, or Inf is encountered.)

That also explains why my example has a low probability for TRUE values; these really occur much less than 0.1% of the time.

Does this suggest a different solution path? Yes: it suggests that we may use two indices (i.e. the locations of TRUE in each set) and count set intersections. I avoided set intersections because I was burned awhile back by Matlab (yes, this is R, but bear with me), which would first sort elements of a set before it does an intersection. (I vaguely recall the complexity was even more embarrassing: like O(n^2) instead of O(n log n).)

解决方案

If you're doing a lot of operations on huge logical vectors, take a look at the bit package. It saves a ton of memory by storing the booleans as true 1-bit booleans.

This doesn't help with table; it actually makes it worse because there are more unique values in the bit vector due to how it's constructed. But it really helps with logical comparisons.

# N <- 3e7
require(bit)
xb <- as.bit(x)
yb <- as.bit(y)
benchmark(replications = 1, order = "elapsed", 
    bit = {res <- func_logical(xb,yb)},
    logical = {res <- func_logical(x,y)}
)
#      test replications elapsed relative user.self sys.self user.child sys.child
# 1     bit            1   0.129  1.00000     0.132    0.000          0         0
# 2 logical            1   3.677 28.50388     2.684    0.928          0         0

这篇关于在R中交叉制表两个大量逻辑向量的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆