pytables重复2.5 giga行 [英] Pytables duplicates 2.5 giga rows

查看:72
本文介绍了pytables重复2.5 giga行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个.h5文件,其中的表由三列组成:一个64字符的文本列,一个与文本源有关的UInt32列和一个UInt32列(即文本的xxhash).该表由〜2.5e9行组成

I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column which is the xxhash of the text. The table consists of ~ 2.5e9 rows

我正在尝试查找并计算表中每个文本条目的重复项-本质上将它们合并为一个条目,同时对实例进行计数.我尝试通过在哈希列上建立索引,然后遍历 table.itersorted(hash)来做到这一点,同时跟踪哈希值并检查冲突-与

I am trying to find and count the duplicates of each text entry in the table - essentially merge them into one entry, while counting the instances. I have tried doing so by indexing on the hash column and then looping through table.itersorted(hash), while keeping track of the hash value and checking for collisions - very similar to finding a duplicate in a hdf5 pytable with 500e6 rows. I did not modify the table as I was looping through it but rather wrote the merged entries to a new table - I am putting the code at the bottom.

基本上,我的问题是整个过程花费的时间太长-我花了大约20个小时才能到达迭代#5 4e5.但是我正在开发HDD,因此完全有可能出现瓶颈.您看到我可以改善我的代码的任何方法,还是可以建议另一种方法?预先感谢您的帮助.

Basically the problem I have is that the whole process takes significantly too long - it took me about 20 hours to get to iteration #5 4e5. I am working on a HDD however, so it is entirely possible the bottleneck is there. Do you see any way I can improve my code, or can you suggest another approach? Thank you in advance for any help.

P.S.我保证我不会做任何非法的事情,这只是我的学士论文的大规模泄露密码分析.

P.S. I promise I am not doing anything illegal, it is simply a large scale leaked password analysis for my Bachelor Thesis.

ref = 3 #manually checked first occuring hash, to simplify the below code
gen_cnt = 0
locs = {}


print("STARTING")
for row in table.itersorted('xhashx'):
    gen_cnt += 1 #so as not to flush after every iteration
    ps = row['password'].decode(encoding = 'utf-8', errors = 'ignore')

    if row['xhashx'] == ref:
        if ps in locs:
            locs[ps][0] += 1
            locs[ps][1] |= row['src']

        else:
            locs[ps] = [1, row['src']]


    else:
        for p in locs:
            fill_password(new_password, locs[ps]) #simply fills in the columns, with some fairly cheap statistics procedures
            new_password.append()   

        if (gen_cnt > 100):
            gen_cnt = 0
            new_table.flush()  

        ref = row['xhashx']```


推荐答案

您的数据集比参考解决方案大10倍(2.5e9对500e6行).您是否进行了任何测试以确定时间在哪里? table.itersorted()方法可能不是线性的-可能会占用大量资源.(我没有任何关于itersorted的经验.)

Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)

这是一个可能更快的过程:

Here is a process that might be faster:

  1. 提取哈希字段的NumPy数组(列 xhashx )
  2. 查找唯一的哈希值
  3. 通过唯一的哈希值循环并提取一个NumPy数组与每个值匹配的行
  4. 针对此提取的数组中的行进行唯一性测试
  5. 将唯一行写入新文件

以下此过程的代码:
注意:此功能尚未经过测试,因此语法或逻辑上的差距可能很小

Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps

# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)

# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))

# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :

# Step 3b: Get an array with all rows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_test')

# Step 4: Check for rows with unique values
# Check the hash row count. 
# If there is only 1 row, uniqueness tested not required  
     if match_row_arr.shape[0] == 1 :
     # only one row, so write it to new.table

     else :
     # check for unique rows
     # then write unique rows to new.table

##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)

# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :

# Get an array with all rows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')

# Check the hash row count. 
# If there is only 1 row, uniqueness tested not required  
     if hash_cnts[cnt] == 1 :
     # only one row, so write it to new.table

     else :
     # check for unique rows
     # then write unique rows to new.table

这篇关于pytables重复2.5 giga行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆