检测字符串中 n-gram 的更快方法? [英] quicker way to detect n-grams in a string?

查看:19
本文介绍了检测字符串中 n-gram 的更快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 SO 上找到了这个解决方案来检测字符串中的 n-gram:(此处:从句子生成 N-gram)

I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence)

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

=> 与毫秒相比,这段代码花费的处理时间最长(检测我的语料库的 1-gram、2-gram、3-gram 和 4gram:4Mb 的原始文本需要 28 秒)用于其他操作(去除停用词等)

=> this bit of code takes by far the longest processing time (28 seconds for the detection of 1-grams, 2-grams, 3-grams and 4grams for my corpus: 4Mb of raw text), as compared to milliseconds for other operations (removal of stopwords etc.)

有人知道 Java 中的解决方案比上面介绍的循环解决方案更快吗?(我在考虑多线程、集合的使用,或者可能是拆分字符串的创造性方法......?)谢谢!

Does anybody know of solutions in Java which would go faster than the looping solution presented above? (I was thinking multithreading, use of collections, or maybe creative ways to split a string...?) Thanks!

推荐答案

你可以试试这样的:

public class NGram {

    private final int n;
    private final String text;

    private final int[] indexes;
    private int index = -1;
    private int found = 0;

    public NGram(String text, int n) {
        this.text = text;
        this.n = n;
        indexes = new int[n];
    }

    private boolean seek() {
        if (index >= text.length()) {
            return false;
        }
        push();
        while(++index < text.length()) {
            if (text.charAt(index) == ' ') {
                found++;
                if (found<n) {
                    push();
                } else {
                    return true;
                }
            }
        }
        return true;
    }

    private void push() {
        for (int i = 0; i < n-1; i++) {
            indexes[i] = indexes[i+1];
        }
        indexes[n-1] = index+1;
    }

    private List<String> list() {
        List<String> ngrams = new ArrayList<String>();
        while (seek()) {
            ngrams.add(get());
        }
        return ngrams;
    }

    private String get() {
        return text.substring(indexes[0], index);
    }
}

对大约 5mb 的文本进行测试,它的执行速度似乎比原始代码快 10 倍.这里的主要区别是正则表达式不用于拆分,ngram 字符串不是通过连接创建的.

Testing on about 5mb of text it seems to perform about 10 times faster than the original code. The main difference here is that regex is not used to split and the ngram strings are not created by concatenation.

更新:这是我在上面提到的文本上运行时得到的输出,ngram 1-4.我用 2GB 的内存运行来决定运行期间对 GC 的影响.我跑了很多次,看看热点编译器的影响.

Update: This is the output I get when running on the text mentioned above, ngram 1-4. I run with 2GB of memory to decide the impact on GC during the runs. I ran multiple times to see the impact of the hotspot compiler.

Loop 01 Code mine ngram 1 time 071ms ngrams 294121
Loop 01 Code orig ngram 1 time 534ms ngrams 294121
Loop 01 Code mine ngram 2 time 016ms ngrams 294120
Loop 01 Code orig ngram 2 time 360ms ngrams 294120
Loop 01 Code mine ngram 3 time 082ms ngrams 294119
Loop 01 Code orig ngram 3 time 319ms ngrams 294119
Loop 01 Code mine ngram 4 time 014ms ngrams 294118
Loop 01 Code orig ngram 4 time 439ms ngrams 294118

Loop 10 Code mine ngram 1 time 013ms ngrams 294121
Loop 10 Code orig ngram 1 time 268ms ngrams 294121
Loop 10 Code mine ngram 2 time 014ms ngrams 294120
Loop 10 Code orig ngram 2 time 323ms ngrams 294120
Loop 10 Code mine ngram 3 time 013ms ngrams 294119
Loop 10 Code orig ngram 3 time 412ms ngrams 294119
Loop 10 Code mine ngram 4 time 014ms ngrams 294118
Loop 10 Code orig ngram 4 time 423ms ngrams 294118

这篇关于检测字符串中 n-gram 的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆