比较两个大文件 [英] Comparing two large files
问题描述
我需要编写一个程序,将写入文件两个文件之间的差异。 该方案已通过超过13.464.448线600 MB的文件循环,检查是否有返回的grep在另一个文件真实,然后将结果写到另一个文件。 我写了一个快速测试大约1.000.000记录,花了一个多小时,所以我猜这个方法可能需要9+小时。
I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours.
你有关于如何使这个更快的任何建议?任何特定的语言,我应该使用?我打算做在bash或蟒蛇。
Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on doing it in bash or python.
感谢很多提前。
:对不起,我说的两个文件之间的差异我并不意味着差异。结果文件是在一个不同的格式。
: Sorry, when i say difference between two files i did not mean a diff. The result file is in a different format.
的逻辑是有点这样的:
文件中有297.599线 文件B拥有超过13万行
File A has 297.599 lines File B has over 13 million lines
我选择将当前行从文件中读取,用grep显示它文件B上,如果该行是present在文件B,我将其写入结果文件。顺便说一句,文件和文件B具有不同的格式。结果文件将有文件的格式。
I pick the current line being read from FILE A, grep it on FILE B, and if the line is present in File B i will write it to the result file. By the way, File A and File B have different formats. Result file will have the format of File A.
:我被要求在工作中创造一个bash解决方案,非常让我们不必对所有这一切都运行在机器上安装python
: I was asked at work to create a bash solution ideally so that we don't have to install python on all the machines this has to run on.
这是我的个当前实现:
#!/bin/bash
LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'`
LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'`
while read -r line; do
MATCH="$(grep $line $LAST_EXP)"
echo "line: $line, match: $MATCH"
# if not empty
if [ ! -z "$MATCH" ]
then
echo $MATCH >> result
fi
done < $LAST_TTP
这bash的方法是采取10小时才能完成。你有任何建议,如何使它在bash更有效率?
This bash approach is taking over 10 hours to complete. Do you have any suggestions on how to make it more efficient in bash?
感谢很多提前!
推荐答案
您很可能在寻找一个列表,而不是一组,导致O(N²)的性能。尝试:
You're probably looking in a list instead of a set, leading to an O(n²) performance. Try:
with open('b') as b:
blines = set(b)
with open('a') as a:
with open('result', 'w') as result:
for line in a:
if line not in blines:
result.write(line)
假设均匀长(而不是超长线),这个实现的性能是 O(| A | + | B |)
(摊销,由于< A HREF =http://wiki.python.org/moin/TimeComplexity#set相对=nofollow> Pyton的设置
是非常快)。的存储器需求是在 O(| B |)。
,而是用一个因子显著大于1
Assuming uniformly long (and not overly long lines), the performance of this implementation is in O(|A| + |B|)
(amortized, due to Pyton's set
being extremely fast). The memory demand is in O(|B|)
, but with a factor significantly greater than 1.
如果在输出线的顺序并不重要,你也可以两个文件进行排序,然后比较他们行由行。这将在 0的顺序性能(| A |登录| A | + B日志| B |)
。内存需求将在 O(| A | + | B |)
,以上precisely, | A |
+ | B |
If the order of lines in the output does not matter, you can also sort both files and then compare them line-by-line. This will have a performance in the order of O(|A| log |A| + B log |B|)
. The memory demand will be in O(|A|+|B|)
, or more precisely, |A|
+ |B|
.
这篇关于比较两个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!