查看两个文件在python中是否具有相同的内容 [英] see if two files have the same content in python

查看:193
本文介绍了查看两个文件在python中是否具有相同的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能的重复项:
查找重复文件并将其删除.
在Python中,有没有比较简单的方法来比较两个文本文件的内容是否相同?

Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?

在Python中查看两个文件在内容方面是否相同的最简单方法是什么.

What is the easiest way to see if two files are the same content-wise in Python.

我能做的一件事是对每个文件md5进行比较.有更好的方法吗?

One thing I can do is md5 each file and compare. Is there a better way?

推荐答案

是的,我认为如果必须比较多个文件并存储哈希以供以后比较,则对文件进行哈希处理将是最好的方法.由于哈希可能会发生冲突,因此可能会根据用例进行逐字节比较.

Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.

通常,逐字节比较将是足够且高效的,哪个filecmp模块已经在做+其他事情.

Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.

请参见 http://docs.python.org/library/filecmp.html 例如

>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False

速度注意事项: 通常,如果只需要比较两个文件,则对它们进行散列和比较会比较慢,而不是简单的逐字节比较(如果有效率的话).例如下面的代码尝试对散列与逐字节时间进行计时

Speed consideration: Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte

免责声明:这不是计时或比较两种算法的最佳方法.需要改进,但确实给出了粗略的想法.如果您认为应该改进它,请告诉我,我将对其进行更改.

Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.

import random
import string
import hashlib
import time

def getRandText(N):
    return  "".join([random.choice(string.printable) for i in xrange(N)])

N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)

def cmpHash(text1, text2):
    hash1 = hashlib.md5()
    hash1.update(text1)
    hash1 = hash1.hexdigest()

    hash2 = hashlib.md5()
    hash2.update(text2)
    hash2 = hash2.hexdigest()

    return  hash1 == hash2

def cmpByteByByte(text1, text2):
    return text1 == text2

for cmpFunc in (cmpHash, cmpByteByByte):
    st = time.time()
    for i in range(10):
        cmpFunc(randText1, randText2)
    print cmpFunc.func_name,time.time()-st

输出为

cmpHash 0.234999895096
cmpByteByByte 0.0

这篇关于查看两个文件在python中是否具有相同的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆