如何计算“二进制”矩阵中唯一行的数量? [英] How should I count the number of unique rows in a 'binary' matrix?

查看:108
本文介绍了如何计算“二进制”矩阵中唯一行的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个矩阵,其条目只有 0 1 ,例如

  set.seed(123)
m< - matrix(sample(0:1,10,TRUE),nrow = 5)

输出示例:

  [,1] [,2] 
[1,] 0 0
[2,] 1 1
[3,] 0 1
[4, 1 1
[5,] 1 0

矩阵最多有20列,并且会有很多行。



我想要一个函数,我们称之为 rowCounts ,返回:


  1. 特定行出现在矩阵中的次数和

  2. 第一次出现的索引

如何解决这个问题?

方案

基于Kevin的回答,这里是一个C ++ 11版本,使用一种稍微不同的方法:

  List rowCounts_2(IntegerMatrix x){
int n = x.nrow();
int nc = x.ncol();
std :: vector< int>散列(n);
for(int k = 0,pow = 1; k IntegerMatrix :: Column column = x.column(k);

std :: transform(column.begin(),column.end(),hashes.begin(),hashes.begin(),[=](int v,int h){
return h + pow * v;
});
}

使用Pair = std :: pair< int,int> ;
std :: unordered_map< int,Pair> map_counts;

for(int i = 0; i Pair& p = map_counts [hashes [i]];
if(p.first == 0){
p.first = i + 1; //使用直接基于1的索引
}
p.second ++;
}

int nres = map_counts.size();
IntegerVector idx(nres),counts(nres);
auto it = map_counts.begin();
for(int i = 0; i idx [i] = it-> second.first;
counts [i] = it-> second.second;
}

return List :: create(_ [counts] = counts,_ [idx] = idx);
}

这个想法是交易记忆力的速度。第一个变化是我分配和填充 std :: vector< int> 来托管散列。这样做可以让我按列更有效地遍历输入矩阵列。



完成后,我将训练一个对(索引,计数)的散列映射 std :: unordered_map< int,std :: pair< int,int>> 。映射的键是哈希,值是一个对(索引,计数)。



然后我只需要遍历哈希映射并收集结果。结果不会按照 idx 的升序排列(如果我们真的想要的话,很容易做到)。



我得到 n = 1e5 n = 1e7

 > m < -  matrix(sample(0:1,1,1e + 05,TRUE),ncol = 10)

>微基准(rowCounts(m),rowCountsR(m),rowCounts_2(m))
单位:微秒
expr min lq median uq max neval
rowCounts(m)1194.536 1201.273 1213.1450 1231.7295 1286.458 100
rowCountsR(m)575.004 933.637 962.8720 981.6015 23678.451 100
rowCounts_2(m)421.744 429.118 442.5095 455.2510 530.261 100

> m < - matrix(sample(0:1,1,1e + 07,TRUE),ncol = 10)

> microbenchmark(rowCounts(m),rowCountsR(m),rowCounts_2(m))
单位:毫秒
expr min lq median uq max neval
rowCounts(m)97.22727 98.02716 98.56641 100.42262 102.07661 100
rowCountsR(m)57.44635 59.46188 69.34481 73.89541 100.43032 100
rowCounts_2(m)22.95741 23.38186 23.78068 24.16814 27.44125 100

利用线程有助于进一步。下面是时间是如何分裂在我的机器上的4个线程。请参阅此 gist 中的代码。 b
$ b

以下是最后版本的基准:

 微基准(rowCountsR(m),rowCounts_1(m),rowCounts_2(m),rowCounts_3(m,4))
单位:毫秒
expr min lq median uq max neval
rowCountsR (m,4)12.50059 12.68981 12.87712 13.10425 17.21966 100


Suppose I have a matrix whose entries are only 0 and 1, e.g.

set.seed(123)
m <- matrix( sample(0:1, 10, TRUE), nrow=5 )

with sample output:

     [,1] [,2]
[1,]    0    0
[2,]    1    1
[3,]    0    1
[4,]    1    1
[5,]    1    0

The matrix will have at most 20 columns, and will have many rows.

I want a function, let's call it rowCounts, that returns:

  1. The number of times a particular row appears in the matrix, and
  2. The index of the first occurrence of that row.

How might I solve this problem?

解决方案

Building on Kevin's answer, here is a C++11 version using a slightly different approach:

List rowCounts_2(IntegerMatrix x) {
  int n = x.nrow() ;
  int nc = x.ncol() ;
  std::vector<int> hashes(n) ;
  for( int k=0, pow=1; k<nc; k++, pow*=2){
    IntegerMatrix::Column column = x.column(k) ;

    std::transform( column.begin(), column.end(), hashes.begin(), hashes.begin(), [=]( int v, int h ){
        return h + pow*v ;
    }) ;
  }

  using Pair = std::pair<int,int> ;
  std::unordered_map<int, Pair> map_counts ;

  for( int i=0; i<n; i++){
    Pair& p = map_counts[ hashes[i] ] ;
    if( p.first == 0){
      p.first = i+1 ; // using directly 1-based index
    }
    p.second++ ;
  }

  int nres = map_counts.size() ;
  IntegerVector idx(nres), counts(nres) ;
  auto it=map_counts.begin() ;
  for( int i=0; i<nres; i++, ++it){
    idx[i] = it->second.first ;
    counts[i] = it->second.second ;
  }

  return List::create( _["counts"] = counts, _["idx"] = idx );
}

The idea is to trade memory for speed. The first change is that I'm allocating and filling a std::vector<int> to host the hashes. Doing this allows me to traverse the input matrix column by column which is more efficient.

Once this is done, I'm training a hash map of pairs (index, counts) std::unordered_map<int, std::pair<int,int>>. The key of the map is the hash, the value is a pair (index, count).

Then I just have to traverse the hash map and collect the results. The results don't appear in ascending order of idx (it is easy to do it if we really want that).

I get these results for n=1e5 and n=1e7.

> m <- matrix(sample(0:1, 1e+05, TRUE), ncol = 10)

> microbenchmark(rowCounts(m), rowCountsR(m), rowCounts_2(m))
Unit: microseconds
           expr      min       lq    median        uq       max neval
   rowCounts(m) 1194.536 1201.273 1213.1450 1231.7295  1286.458   100
  rowCountsR(m)  575.004  933.637  962.8720  981.6015 23678.451   100
 rowCounts_2(m)  421.744  429.118  442.5095  455.2510   530.261   100

> m <- matrix(sample(0:1, 1e+07, TRUE), ncol = 10)

> microbenchmark(rowCounts(m), rowCountsR(m), rowCounts_2(m))
Unit: milliseconds
           expr      min       lq   median        uq       max neval
   rowCounts(m) 97.22727 98.02716 98.56641 100.42262 102.07661   100
  rowCountsR(m) 57.44635 59.46188 69.34481  73.89541 100.43032   100
 rowCounts_2(m) 22.95741 23.38186 23.78068  24.16814  27.44125   100

Taking advantage of threading helps further. Below is how the time is split between 4 threads on my machine. See the code in this gist.

Here are benchmarks with the last version too:

> microbenchmark(rowCountsR(m), rowCounts_1(m), rowCounts_2(m), rowCounts_3(m,4))
Unit: milliseconds
              expr       min        lq    median        uq       max neval
     rowCountsR(m)  93.67895 127.58762 127.81847 128.03472 151.54455   100
    rowCounts_1(m) 120.47675 120.89169 121.31227 122.86422 137.86543   100
    rowCounts_2(m)  28.88102  29.68101  29.83790  29.97112  38.14453   100
 rowCounts_3(m, 4)  12.50059  12.68981  12.87712  13.10425  17.21966   100

这篇关于如何计算“二进制”矩阵中唯一行的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆