如何比较大文本文件? [英] How to compare large text files?

查看:168
本文介绍了如何比较大文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于您对我的技术意见的一般问题。



有两个文本文件( file_1 file_2 ),需要彼此进行比较。两者都非常巨大(3-4千兆字节,从30,000,000到45,000,000行)。
我的想法是读取内存中的 file_1 的几行(尽可能多),然后将这些行与所有 file_2 。如果有匹配,来自两个匹配的文件的行将被写入一个新文件。然后继续使用 file_1 的下一行1000,并将 file_2的全部行比较直到我完全通过 file_1



但这听起来真的,真的很耗时, 。
你能想到任何其他方法来比较这两个文件吗?



你认为比较可以花多长时间?
对于我的程序,时间并不重要。我没有使用这样巨大的文件的经验,因此我不知道这可能需要多长时间。它不应该需要一天以上。 ;-)但我恐怕我的技术可以永远...



Antoher问题刚刚到我的脑海:你会读多少行到内存?越多越好?有没有办法在实际尝试之前确定可能的行数?
我想尽可能多读(因为我觉得更快),但是我经常用尽内存。



提前感谢。 / p>

EDIT
我想我必须更多地解释我的问题。



目的不是看一般两个文件是否相同(他们不是)。
每个文件中有一些行共享相同的特征。
下面是一个例子:
file_1 看起来有点像这样:

 code> mat1 1000 2000 TEXT //这意味着范围是1000  -  2000 
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT

file_2 如下所示:

  mat3 10009 TEXT 
mat3 200 TEXT
mat1 999 TEXT

TEXT 指的是对我不感兴趣的字符和数字, mat 可以从 mat1 - mat50 开始,并且没有顺序;也可以有1000x mat2 (但下一列中的数字不同)。我需要找到匹配线,以这样的方式:matX是相同的在两个比较的行在 file_2 中提到的数字适合在 file_1
在我的示例中,我会找到一个匹配: file_1 的第3行和 file_2的第1行因为两者都是mat3和10009在10000和10010之间)。



所以我的问题是:如何搜索匹配的行?



是的,我使用Java作为我的编程语言。



EDIT 首先,使我没有内存不足的问题。我也认为比起这两个巨大的文件,比较(许多)较小的文件比它更快。之后,我可以比较他们的方式我上面提到。这可能不是完美的方式,但我仍然在学习;-)
不管你的方法对我非常有帮助,谢谢你的答复!

解决方案

现在你已经给了我们更多的细节,我采取的方法依赖于预分区,并且可选地,在搜索匹配之前进行排序。



这应该消除大量的比较,否则在原始的,暴力的方法中无论如何不匹配。



分区:读取 file_1 ,并将以 mat1 开头的所有行发送到 file_1_mat1 ,以此类推。对 file_2 执行相同操作。这是一个微不足道的一点 grep ,或者你应该在Java中以编程方式做它是一个初学者的练习。





排序:

strong>对于每个分区,仅根据第二列中的数值进行排序( file_1 的下限和 file_2 )。即使800,000行不能容纳到内存中,我想我们可以适应双向外部合并排序,并且比整个未分区的空间更快地执行(整体读取更少)。



比较:现在您只需要通过 file_1_mat1 file_2_mat1 ,无需在内存中保留任何内容,输出与输出文件匹配。依次对其余分区重复。不需要最后的合并步骤(除非你正在并行处理分区)。



即使没有排序阶段,你已经做的幼稚比较应该工作更快的跨50对文件,每个800,000行,而不是两个文件,每个4000万行。


I have a general question on your opinion about my "technique".

There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each). My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.

But this sounds actually really, really time consuming and complicated to me. Can you think of any other method to compare those two files?

How long do you think the comparison could take? For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...

Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it? I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.

Thanks in advance.

EDIT I think I have to explain my problem a bit more.

The purpose is not to see if the two files in general are identical (they are not). There are some lines in each file that share the same "characteristic". Here's an example: file_1 looks somewhat like this:

mat1 1000 2000 TEXT      //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT

file_2looks like this:

mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT

TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1. So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010). I hope this makes it clear to you!

So my question is: how would you search for the matching lines?

Yes, I use Java as my programming language.

EDIT I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-) Nonentheless all your approaches were very helpful to me, thank you for your replies!

解决方案

Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.

This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.

Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.

That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.

Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.

Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).

Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.

这篇关于如何比较大文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆