比较两个大文件 [英] Comparing two large files

查看：156 发布时间：2015/11/30 20:59:03 python algorithm bash grep large-files

本文介绍了比较两个大文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要编写一个程序，将写入文件两个文件之间的差异。该方案已通过超过13.464.448线600 MB的文件循环，检查是否有返回的grep在另一个文件真实，然后将结果写到另一个文件。我写了一个快速测试大约1.000.000记录，花了一个多小时，所以我猜这个方法可能需要9+小时。

I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours.

你有关于如何使这个更快的任何建议？任何特定的语言，我应该使用？我打算做在bash或蟒蛇。

Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on doing it in bash or python.

感谢很多提前。

：对不起，我说的两个文件之间的差异我并不意味着差异。结果文件是在一个不同的格式。

: Sorry, when i say difference between two files i did not mean a diff. The result file is in a different format.

的逻辑是有点这样的：

文件中有297.599线文件B拥有超过13万行

File A has 297.599 lines File B has over 13 million lines

我选择将当前行从文件中读取，用grep显示它文件B上，如果该行是present在文件B，我将其写入结果文件。顺便说一句，文件和文件B具有不同的格式。结果文件将有文件的格式。

I pick the current line being read from FILE A, grep it on FILE B, and if the line is present in File B i will write it to the result file. By the way, File A and File B have different formats. Result file will have the format of File A.

：我被要求在工作中创造一个bash解决方案，非常让我们不必对所有这一切都运行在机器上安装python

: I was asked at work to create a bash solution ideally so that we don't have to install python on all the machines this has to run on.

这是我的个当前实现：

#!/bin/bash

LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'`
LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'`

while read -r line; do
   MATCH="$(grep $line $LAST_EXP)"
   echo "line: $line, match: $MATCH"

   # if not empty
   if [ ! -z "$MATCH" ]
   then
      echo $MATCH >> result
   fi

done < $LAST_TTP

这bash的方法是采取10小时才能完成。你有任何建议，如何使它在bash更有效率？

This bash approach is taking over 10 hours to complete. Do you have any suggestions on how to make it more efficient in bash?

感谢很多提前！

推荐答案

您很可能在寻找一个列表，而不是一组，导致O（N²）的性能。尝试：

You're probably looking in a list instead of a set, leading to an O(n²) performance. Try:

with open('b') as b:
  blines = set(b)
with open('a') as a:
  with open('result', 'w') as result:
    for line in a:
      if line not in blines:
        result.write(line)

假设均匀长（而不是超长线），这个实现的性能是 O（| A | + | B |）（摊销，由于< A HREF =http://wiki.python.org/moin/TimeComplexity#set相对=nofollow> Pyton的设置是非常快）。的存储器需求是在 O（| B |）。，而是用一个因子显著大于1

Assuming uniformly long (and not overly long lines), the performance of this implementation is in O(|A| + |B|) (amortized, due to Pyton's set being extremely fast). The memory demand is in O(|B|), but with a factor significantly greater than 1.

如果在输出线的顺序并不重要，你也可以两个文件进行排序，然后比较他们行由行。这将在 0的顺序性能（| A |登录| A | + B日志| B |）。内存需求将在 O（| A | + | B |），以上precisely， | A | + | B |

If the order of lines in the output does not matter, you can also sort both files and then compare them line-by-line. This will have a performance in the order of O(|A| log |A| + B log |B|). The memory demand will be in O(|A|+|B|), or more precisely, |A| + |B|.

这篇关于比较两个大文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比较两个大文件 [英] Comparing two large files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

比较两个大文件 [英] Comparing two large files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭