包含1亿个字符串的大型文本文件中的高效子字符串搜索(无重复字符串) [英] Efficient substring search in a large text file containing 100 millions strings(no duplicate string)

查看:1243
本文介绍了包含1亿个字符串的大型文本文件中的高效子字符串搜索(无重复字符串)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文本文件(1.5 Gb),有100万个字符串(没有重复的字符串),并且所有字符串都在文件中逐行排列。我想在java中进行wepapplication,以便当用户给出一个关键字(Substring)时,他得到包含该关键字的文件中存在的所有字符串的计数。
我知道一种技术LUCENE已经......还有其他方法可以做到这一点。我想在3-4秒内得到结果。
MY SYSTEM有4GB内存和双核配置....需要在JAVA ONLY中执行此操作

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in java so that when user give a keyword(Substring) he get the count of All the strings present in the file which contains that keyword. i know one technique LUCENE already..is there any other way to do this.?? i want the result within 3-4 seconds. MY SYSTEM HAS 4GB RAM AND DUAL CORE configuration.... need to do this in "JAVA ONLY"

推荐答案

由于RAM的大小超过文件的大小,因此您可以将整个数据作为结构存储在RAM中并快速搜索。 特里可能是一个很好的数据结构;它确实有快速的前缀查找,但不确定它如何执行子串。

Since you have more RAM than the size of the file, you might be able to store the entire data as a structure in the RAM and search it very quickly. A trie might be a good data structure to use; it does have fast prefix finding, but not sure how it performs for substrings.

这篇关于包含1亿个字符串的大型文本文件中的高效子字符串搜索(无重复字符串)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆