查找大的文本文件重复记录 [英] Find duplicate records in large text file

查看:146
本文介绍了查找大的文本文件重复记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个Linux机器(红帽)上我有一个11GB的文本文件。在文本文件中的每一行包含数据为单个记录和行的前n个字符包含记录的唯一标识符。该文件包含了一个几乎没有什么万条记录。

I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.

我要验证是否有不符合文件中的相同的唯一标识符的多个记录。我还需要在一个80GB的文本文件,以便需要加载整个文件到内存是不实际的任何解决方案执行这个过程。

I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.

推荐答案

读取文件中的行由行,所以你不必给它的所有加载到内存中。

Read the file line-by-line, so you don't have to load it all into memory.

有关每一行(记录)创建一个SHA256哈希(32字节),除非你的标识符为短。

For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.

numpy.array 店铺散列/标识符。这可能是来存储它们的最紧凑的方式。 2700万次的记录32字节/哈希为864 MB。这应该融入体面机的内存,这些天。

Store the hashes/identifiers in an numpy.array. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.

要加快访问速度,你可以使用例如第一2个字节的哈希作为一个 collections.defaultdict 键并把散列的其余部分在值列表。这实际上将创建65536桶哈希表。为27e6的记录,每个桶将平均含有大约400项的列表。
这将意味着快于numpy的阵列搜索,但它会使用更多的内存。

To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdict and put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries. It would mean faster searching than a numpy array, but it would use more memory.

d = collections.defaultdict(list)
with open('bigdata.txt', 'r') as datafile:
    for line in datafile:
        id = hashlib.sha256(line).digest()
        # Or id = line[:n]
        k = id[0:2]
        v = id[2:]
        if v in d[k]:
            print "double found:", id
        else:
            d[k].append(v)

这篇关于查找大的文本文件重复记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆