更快地检测字符串中的n-gram? [英] quicker way to detect n-grams in a string?

查看:120
本文介绍了更快地检测字符串中的n-gram?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在SO上发现了这个解决方案来检测字符串中的n-gram:
(这里:从句子生成N-gram

I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence)

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

=>这段代码到目前为止,最长的处理时间(我的语料库检测1克,2克,3克和4克为28秒:原始文本为4Mb),与其他操作的毫秒(删除停用词等)相比较。 )

=> this bit of code takes by far the longest processing time (28 seconds for the detection of 1-grams, 2-grams, 3-grams and 4grams for my corpus: 4Mb of raw text), as compared to milliseconds for other operations (removal of stopwords etc.)

有人知道Java中的解决方案会比上面提到的循环解决方案更快吗? (我在考虑使用多线程,使用集合,或者使用创造性的方法来分割字符串......?)谢谢!

Does anybody know of solutions in Java which would go faster than the looping solution presented above? (I was thinking multithreading, use of collections, or maybe creative ways to split a string...?) Thanks!

推荐答案

您可以尝试这样的事情:

You could try something like this:

public class NGram {

    private final int n;
    private final String text;

    private final int[] indexes;
    private int index = -1;
    private int found = 0;

    public NGram(String text, int n) {
        this.text = text;
        this.n = n;
        indexes = new int[n];
    }

    private boolean seek() {
        if (index >= text.length()) {
            return false;
        }
        push();
        while(++index < text.length()) {
            if (text.charAt(index) == ' ') {
                found++;
                if (found<n) {
                    push();
                } else {
                    return true;
                }
            }
        }
        return true;
    }

    private void push() {
        for (int i = 0; i < n-1; i++) {
            indexes[i] = indexes[i+1];
        }
        indexes[n-1] = index+1;
    }

    private List<String> list() {
        List<String> ngrams = new ArrayList<String>();
        while (seek()) {
            ngrams.add(get());
        }
        return ngrams;
    }

    private String get() {
        return text.substring(indexes[0], index);
    }
}

对大约5mb的文本进行测试比原始代码快10倍。这里的主要区别是正则表达式不用于拆分,而ngram字符串不是通过连接创建的。

Testing on about 5mb of text it seems to perform about 10 times faster than the original code. The main difference here is that regex is not used to split and the ngram strings are not created by concatenation.

更新:
这是我得到的输出在上面提到的文本上运行,ngram 1-4。我运行2GB内存来决定运行期间对GC的影响。我多次运行以查看热点编译器的影响。

Update: This is the output I get when running on the text mentioned above, ngram 1-4. I run with 2GB of memory to decide the impact on GC during the runs. I ran multiple times to see the impact of the hotspot compiler.

Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118

Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118

这篇关于更快地检测字符串中的n-gram?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆