如何找到共同的字符串中两个非常大的文件? [英] How to find common strings among two very large files?

查看:156
本文介绍了如何找到共同的字符串中两个非常大的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的两个非常大的文件(和,他们都会放在内存)。 每个文件都有一个字符串(不具有在它的空间,且为99/100/101字符)每行

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn't have spaces in it and is either 99/100/101 characters long) on each line.

更新:字符串不以任何有序
UPDATE2:我与Java在Windows上工作

Update: The strings are not in any sorted order.
Update2: I am working with Java on Windows.

现在我想弄清楚的最佳途径以找出所有发生在这两个文件中的字符串。

Now I want to figure out the best way to find out all the strings that occur in both the files.

我一直在考虑使用外部合并排序两个文件进行排序,然后做对比,但我不知道这是做最好的方法。由于串大多是围绕着相同的长度,我总是想,如果计算某种散列每个字符串将是一个不错的主意,因为这应该更容易串之间的比较,但随后这将意味着我必须存储的哈希值计算了我已经从文件迄今,以便它们可以在稍后与其他字符串比较它们时使用中遇到的字符串。我不能够牵制的究竟会是最好的方式。我期待您的建议。

I have been thinking about using external merge sort to sort both the files and then do comparison but I am not sure if that would be the best way to do it. Since the strings are mostly around the same length, I was always wondering if computing some kind of a hash for each string would be a good idea, since that should make comparisons between strings easier, but then that would mean I have to store the hashes computed for the strings I have encountered from the files so far so that they can be used later when comparing them with other strings. I am not able to pin down on what exactly would be the best way. I am looking for your suggestions.

当你提出一个解决方案,也请说明是否该解决方案将工作,如果有发生在所有的人都超过2个文件和字符串必须想通了。

When you suggest a solution, also please state if the solution would work if there were more than 2 files and strings which occur in all of them had to be figured out.

推荐答案

您还没有说你的工作在什么平台上,所以我假设你在Windows上工作,但万一你在在Unix平台上,标准工具会为你做它。

You haven't said what platform you're working on, so I assume you're working on Windows, but in the unlikely event that you're on a Unix platform, standard tools will do it for you.

sort file1 | uniq > output
sort file2 | uniq >> output
sort file3 | uniq >> output
...
sort output | uniq -d

这篇关于如何找到共同的字符串中两个非常大的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆