查看两个文件在python中是否具有相同的内容 [英] see if two files have the same content in python
问题描述
Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?
在Python中查看两个文件在内容方面是否相同的最简单方法是什么.
What is the easiest way to see if two files are the same content-wise in Python.
我能做的一件事是对每个文件md5进行比较.有更好的方法吗?
One thing I can do is md5 each file and compare. Is there a better way?
推荐答案
是的,我认为如果必须比较多个文件并存储哈希以供以后比较,则对文件进行哈希处理将是最好的方法.由于哈希可能会发生冲突,因此可能会根据用例进行逐字节比较.
Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.
通常,逐字节比较将是足够且高效的,哪个filecmp模块已经在做+其他事情.
Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.
请参见 http://docs.python.org/library/filecmp.html 例如
>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False
速度注意事项: 通常,如果只需要比较两个文件,则对它们进行散列和比较会比较慢,而不是简单的逐字节比较(如果有效率的话).例如下面的代码尝试对散列与逐字节时间进行计时
Speed consideration: Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte
免责声明:这不是计时或比较两种算法的最佳方法.需要改进,但确实给出了粗略的想法.如果您认为应该改进它,请告诉我,我将对其进行更改.
Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.
import random
import string
import hashlib
import time
def getRandText(N):
return "".join([random.choice(string.printable) for i in xrange(N)])
N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)
def cmpHash(text1, text2):
hash1 = hashlib.md5()
hash1.update(text1)
hash1 = hash1.hexdigest()
hash2 = hashlib.md5()
hash2.update(text2)
hash2 = hash2.hexdigest()
return hash1 == hash2
def cmpByteByByte(text1, text2):
return text1 == text2
for cmpFunc in (cmpHash, cmpByteByByte):
st = time.time()
for i in range(10):
cmpFunc(randText1, randText2)
print cmpFunc.func_name,time.time()-st
输出为
cmpHash 0.234999895096
cmpByteByByte 0.0
这篇关于查看两个文件在python中是否具有相同的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!