比较两个大文件 [英] Comparing two large files

查看:156
本文介绍了比较两个大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要编写一个程序,将写入文件两个文件之间的差异。 该方案已通过超过13.464.448线600 MB的文件循环,检查是否有返回的grep在另一个文件真实,然后将结果写到另一个文件。 我写了一个快速测试大约1.000.000记录,花了一个多小时,所以我猜这个方法可能需要9+小时。

I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours.

你有关于如何使这个更快的任何建议?任何特定的语言,我应该使用?我打算做在bash或蟒蛇。

Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on doing it in bash or python.

感谢很多提前。

对不起,我说的两个文件之间的差异我并不意味着差异。结果文件是在一个不同的格式。

: Sorry, when i say difference between two files i did not mean a diff. The result file is in a different format.

的逻辑是有点这样的:

文件中有297.599线 文件B拥有超过13万行

File A has 297.599 lines File B has over 13 million lines

我选择将当前行从文件中读取,用grep显示它文件B上,如果该行是present在文件B,我将其写入结果文件。顺便说一句,文件和文件B具有不同的格式。结果文件将有文件的格式。

I pick the current line being read from FILE A, grep it on FILE B, and if the line is present in File B i will write it to the result file. By the way, File A and File B have different formats. Result file will have the format of File A.

我被要求在工作中创造一个bash解决方案,非常让我们不必对所有这一切都运行在机器上安装python

: I was asked at work to create a bash solution ideally so that we don't have to install python on all the machines this has to run on.

这是我的个当前实现:

#!/bin/bash

LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'`
LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'`

while read -r line; do
   MATCH="$(grep $line $LAST_EXP)"
   echo "line: $line, match: $MATCH"

   # if not empty
   if [ ! -z "$MATCH" ]
   then
      echo $MATCH >> result
   fi

done < $LAST_TTP

这bash的方法是采取10小时才能完成。你有任何建议,如何使它在bash更有效率?

This bash approach is taking over 10 hours to complete. Do you have any suggestions on how to make it more efficient in bash?

感谢很多提前!

推荐答案

您很可能在寻找一个列表,而不是一组,导致O(N²)的性能。尝试:

You're probably looking in a list instead of a set, leading to an O(n²) performance. Try:

with open('b') as b:
  blines = set(b)
with open('a') as a:
  with open('result', 'w') as result:
    for line in a:
      if line not in blines:
        result.write(line)

假设均匀长(而不是超长线),这个实现的性能是 O(| A | + | B |)(摊销,由于< A HREF =htt​​p://wiki.python.org/moin/TimeComplexity#set相对=nofollow> Pyton的设置是非常快)。的存储器需求是在 O(| B |)。,而是用一个因子显著大于1

Assuming uniformly long (and not overly long lines), the performance of this implementation is in O(|A| + |B|) (amortized, due to Pyton's set being extremely fast). The memory demand is in O(|B|), but with a factor significantly greater than 1.

如果在输出线的顺序并不重要,你也可以两个文件进行排序,然后比较他们行由行。这将在 0的顺序性能(| A |登录| A | + B日志| B |)。内存需求将在 O(| A | + | B |),以上precisely, | A | + | B |

If the order of lines in the output does not matter, you can also sort both files and then compare them line-by-line. This will have a performance in the order of O(|A| log |A| + B log |B|). The memory demand will be in O(|A|+|B|), or more precisely, |A| + |B|.

这篇关于比较两个大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆