用grep或python比较大文件 [英] Comparing large files with grep or python

查看:73
本文介绍了用grep或python比较大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个URL列表,我想知道新的字符串.示例:

I have two lists of urls and I want to know new string. Example:

listA.txt
string1
string2

listB.txt
string1
string3

然后我比较两个列表,以了解列表B中的新字符串:

Then I compare both lists, to know the new string in list B:

grep -w -f listA.txt -v listB.txt

cat listA.txt | grep -Fxvf - listB.txt

最终结果:

string3

问题是我有数百万个字符串,因此运行该命令会消耗PC的所有资源并崩溃.

The problem is that i have a millions of strings, so running the command consumes all the resources of my PC and collapses.

是否可以使用python(消耗更少的资源并且速度更快)来做到这一点

Is there any way to do this with python (which consumes fewer resources and is faster)

谢谢

推荐答案

此方法从第一个文件( listA )创建一个集合.唯一的内存需求是有足够的空间来容纳此集合.然后,它会遍历 listB.txt 文件中的每个URL(非常高效地使用内存).如果该URL不在此集合中,它将把它写入一个新文件(也非常节省内存).

This method creates a set from the first file (listA). The the only memory requirement is enough space to hold this set. It then iterates through each url in the listB.txt file (very memory efficient). If the url is not in this set, it writes it to a new file (also very memory efficient).

filename_1 = 'listA.txt'
filename_2 = 'listB.txt'
filename_3 = 'listC.txt'
with open(filename_1, 'r') as f1, open(filename_2, 'r') as f2, open(filename_3, 'w') as fout:
    s = set(val.strip() for val in f1.readlines())
    for row in f2:
        row = row.strip()
        if row not in s:
            fout.write(row + '\n')

这篇关于用grep或python比较大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆