如何以更有效的方式从大型集合文件中删除停用词？ [英] How to remove stop words from a large collection files with more efficient way?

查看：233 发布时间：2019/1/8 20:26:18 java algorithm stop-words

本文介绍了如何以更有效的方式从大型集合文件中删除停用词？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有200,000个文件，我将为每个文件处理和提取令牌。所有文件的大小为1.5GB。当我编写用于从每个文件中提取标记的代码时，它运行良好。在所有执行时间都是10分钟。

I have 200,000 files for which I've to process and extract tokens for each file. The size of all files is 1.5GB. When I wrote the code for extracting tokens from each file, it works well. Over all execution time is 10mins.

之后，我试图删除停用词性能严重下降。这需要25到30分钟。

After that, I tried to remove stopwords Performance went down badly. It's taking 25 to 30 mins.

我正在使用网站上的停用词这里有大约571个停用词。一般程序是立即从文本文件中提取每个停用词，并与文件中的每个标记进行比较。

I'm using stop words from the website here There are around 571 stop words. The general procedure is to extract each stop word from a text file at once and compare with each token in the file.

这是代码的存根

StringBuilder sb = new StringBuilder();
for(String s : tokens)
    Scanner sc=new Scanner(new File("stopwords.txt"));
    while(sc.hasNext())
    {
        if(sc.next().equals(s)){
            flag = true;
            break;
        }
    }
    if(flag)
        sb.append(s + "\n" );
    flag = false;
}
String str = sb.toString()

**忽略错误。

上述代码的性能至少比代码低10倍。执行需要50到60分钟。

The performance of above code is at least 10 times less than below code. It takes 50 to 60 mins to execute.

StringBuilder sb = new StringBuilder();
String s = tokens.toString();
String str = s.replaceAll("StopWord1|Stopword2|Stopword3|........|LastStopWord"," ");

表现非常好。这需要20到25分钟。

Performance is far good. This takes 20 to 25 mins.

有没有更好的程序？

如何以更有效的方式从大型集合文件中删除停用词？ [英] How to remove stop words from a large collection files with more efficient way?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何以更有效的方式从大型集合文件中删除停用词？ [英] How to remove stop words from a large collection files with more efficient way?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭