在具有500e6行的hdf5 pytable中查找重复项 [英] finding a duplicate in a hdf5 pytable with 500e6 rows

查看:76
本文介绍了在具有500e6行的hdf5 pytable中查找重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我有一个很大的数据集(> 500e6行),已将其放入pytables数据库中.

假设第一列是ID,第二列是每个ID的计数器.每个ID计数器组合必须是唯一的.我要查找的500e6行中有一个非唯一行.

作为初学者,我已经做了类似的事情:

index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
    query = '(id == %d) & (counts == %d)' % (row['id'],  row['counts'])
    result = th.readWhere(query)
    if len(result) > 1:
        print row

我承认这是一种蛮力方法.有什么改进建议吗?

更新

当前的蛮力运行时间是8421分钟.

解决方案 感谢大家的意见和建议.我设法使用以下方法将运行时缩短至2364.7秒:

ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)

ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
  if row['hash'] == ref:
    dups.append(row['hash'] )
  ref = row['hash']

print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)

我可以生成一个完美的哈希,因为我的最大值小于2 ^ 16.我实际上是将两列打包为32位int.

一旦生成了csindex,遍历排序后的值并对重复项进行邻居测试就相当简单了.

此方法可能可以稍作调整,但是我正在测试一些可能提供更自然解决方案的替代方法.

解决方案

想到了两种显而易见的技术:哈希和排序.

A)定义一个哈希函数,将ID和Counter组合为一个紧凑的值.

B)计算每个哈希码出现的频率

C)从您的数据中选择所有具有哈希冲突的数据(这应该是一个小得多"的数据集)

D)对该数据集进行排序以查找重复项.

需要选择A)中的哈希函数,使其适合主存储器,同时提供足够的选择性.为此,可能使用两个2 ^ 30左右大小的位集.您可以承受5-10%的冲突,这仍应减少数据集的大小,以允许事后进行快速的内存内排序.

这本质上是布隆过滤器.

Problem

I have a large (> 500e6 rows) dataset that I've put into a pytables database.

Lets say first column is ID, second column is counter for each ID. each ID-counter combination has to be unique. I have one non-unique row amongst 500e6 rows I'm trying to find.

As a starter I've done something like this:

index1 = db.cols.id.create_index()
index2 = db.cols.counts.create_index()
for row in db:
    query = '(id == %d) & (counts == %d)' % (row['id'],  row['counts'])
    result = th.readWhere(query)
    if len(result) > 1:
        print row

It's a brute force method I'll admit. Any suggestions on improvements?

update

current brute force runtime is 8421 minutes.

solution Thanks for the input everyone. I managed to get the runtime down to 2364.7 seconds using the following method:

ex = tb.Expr('(x * 65536) + y', uservars = {"x":th.cols.id, "y":th.cols.counts})
ex = tb.Expr(expr)
ex.setOutput(th.cols.hash)
ex.eval()
indexrows = th.cols.hash.create_csindex(filters=filters)

ref = None
dups = []
for row in th.itersorted(sortby=th.cols.hash):
  if row['hash'] == ref:
    dups.append(row['hash'] )
  ref = row['hash']

print("ids: ", np.right_shift(np.array(dups, dtype=np.int64), 16))
print("counts: ", np.array(dups, dtype=np.int64) & 65536-1)

I can generate a perfect hash because my maximum values are less than 2^16. I am effectively bit packing the two columns into a 32 bit int.

Once the csindex is generated it is fairly trivial to iterate over the sorted values and do a neighbor test for duplicates.

This method can probably be tweaked a bit, but I'm testing a few alternatives that may provide a more natural solution.

解决方案

Two obvious techniques come to mind: hashing and sorting.

A) define a hash function to combine ID and Counter into a single, compact value.

B) count how often each hash code occurs

C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)

D) sort this data set to find duplicates.

The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.

This is essentially a Bloom filter.

这篇关于在具有500e6行的hdf5 pytable中查找重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆