如何以更有效的方式从大型集合文件中删除停用词? [英] How to remove stop words from a large collection files with more efficient way?

查看:233
本文介绍了如何以更有效的方式从大型集合文件中删除停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有200,000个文件,我将为每个文件处理和提取令牌。所有文件的大小为1.5GB。当我编写用于从每个文件中提取标记的代码时,它运行良好。在所有执行时间都是10分钟。

I have 200,000 files for which I've to process and extract tokens for each file. The size of all files is 1.5GB. When I wrote the code for extracting tokens from each file, it works well. Over all execution time is 10mins.

之后,我试图删除停用词性能严重下降。这需要25到30分钟。

After that, I tried to remove stopwords Performance went down badly. It's taking 25 to 30 mins.

我正在使用网站上的停用词这里有大约571个停用词。一般程序是立即从文本文件中提取每个停用词,并与文件中的每个标记进行比较。

I'm using stop words from the website here There are around 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each token in the file.

这是代码的存根

StringBuilder sb = new StringBuilder();
for(String s : tokens)
    Scanner sc=new Scanner(new File("stopwords.txt"));
    while(sc.hasNext())
    {
        if(sc.next().equals(s)){
            flag = true;
            break;
        }
    }
    if(flag)
        sb.append(s + "\n" );
    flag = false;
}
String str = sb.toString()

**忽略错误。

上述代码的性能至少比代码低10倍。执行需要50到60分钟。

The performance of above code is at least 10 times less than below code. It takes 50 to 60 mins to execute.

StringBuilder sb = new StringBuilder();
String s = tokens.toString();
String str = s.replaceAll("StopWord1|Stopword2|Stopword3|........|LastStopWord"," ");

表现非常好。这需要20到25分钟。

Performance is far good. This takes 20 to 25 mins.

有没有更好的程序?

推荐答案

当然这很糟糕。你正在做 O(n ^ 2)比较。对于你要与另一个词进行比较的每个单词。您需要重新考虑您的算法。

Of course this is bad. You are doing O(n^2) comparisons. For every word you are comparing with another word. You need to rethink your algorithm.

将所有停用词读入 HashSet< String> 然后只需检查 set.contains(word)。这将大大提高您的表现。

Read all the stop words in to a HashSet<String> and then just check set.contains(word). This will improve your performance dramatically.

这篇关于如何以更有效的方式从大型集合文件中删除停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆