如何以更有效的方式从大型集合文件中删除停用词? [英] How to remove stop words from a large collection files with more efficient way?
问题描述
我有200,000个文件,我将为每个文件处理和提取令牌。所有文件的大小为1.5GB。当我编写用于从每个文件中提取标记的代码时,它运行良好。在所有执行时间都是10分钟。
I have 200,000 files for which I've to process and extract tokens for each file. The size of all files is 1.5GB. When I wrote the code for extracting tokens from each file, it works well. Over all execution time is 10mins.
之后,我试图删除停用词
性能严重下降。这需要25到30分钟。
After that, I tried to remove stopwords
Performance went down badly. It's taking 25 to 30 mins.
我正在使用网站上的停用词这里有大约571个停用词。一般程序是立即从文本文件中提取每个停用词,并与文件中的每个标记进行比较。
I'm using stop words from the website here There are around 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each token in the file.
这是代码的存根
StringBuilder sb = new StringBuilder();
for(String s : tokens)
Scanner sc=new Scanner(new File("stopwords.txt"));
while(sc.hasNext())
{
if(sc.next().equals(s)){
flag = true;
break;
}
}
if(flag)
sb.append(s + "\n" );
flag = false;
}
String str = sb.toString()
**忽略错误。
上述代码的性能至少比代码低10倍。执行需要50到60分钟。
The performance of above code is at least 10 times less than below code. It takes 50 to 60 mins to execute.
StringBuilder sb = new StringBuilder();
String s = tokens.toString();
String str = s.replaceAll("StopWord1|Stopword2|Stopword3|........|LastStopWord"," ");
表现非常好。这需要20到25分钟。
Performance is far good. This takes 20 to 25 mins.
有没有更好的程序?
推荐答案
当然这很糟糕。你正在做 O(n ^ 2)
比较。对于你要与另一个词进行比较的每个单词。您需要重新考虑您的算法。
Of course this is bad. You are doing O(n^2)
comparisons. For every word you are comparing with another word. You need to rethink your algorithm.
将所有停用词读入 HashSet< String>
然后只需检查 set.contains(word)
。这将大大提高您的表现。
Read all the stop words in to a HashSet<String>
and then just check set.contains(word)
. This will improve your performance dramatically.
这篇关于如何以更有效的方式从大型集合文件中删除停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!