如何使正则表达式词干常用前缀? [英] How to make common prefixes for regex word stemming?

查看:90
本文介绍了如何使正则表达式词干常用前缀?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的话,我需要做数组的查找和替换正则表达式所操作的,而有时这种阵列可以是几千字的长。我测试过,发现制止使用通用前缀的话是不是单独为他们寻找快得多。也就是说, ^其中|为什么$ 慢于 ^ WH(ERE | Y)$ 。显然,这是不是在这么短的例子一个明显的区别,但它是相当快的地方有成千上万的替代品和主题串长。

I have an array of words I need to do a find-and-replace by regex operation on, and sometimes this array can be thousands of words long. I've tested and found that stemming the words using common prefixes is much faster than searching for them individually. That is, ^where|why$ is slower than ^wh(ere|y)$. Obviously it's not a noticeable difference in such a short example, but it's considerably faster where there are thousands of alternatives and the subject string is long.

所以我在寻找一个办法做到这一点自动词干,例如转换一个的String [] {是什么,为什么,哪里,何时,这} WH(在| Y | E(重| N)| I(CH))

So I'm looking for a way to do this stemming automatically, for instance to convert a string[] { "what", "why", "where", "when", "which" } into wh(at|y|e(re|n)|i(ch))

是否已有一个公认的算法,在那里,做到这一点?如果没有,你会怎么做呢?这似乎需要进行递归做,但我不能完全得到我的头一轮如何做到这一点。我有我写了一个方法,它的工作原理程度有限,但它是不雅,60线多头,使用多个嵌套的foreach循环所以这是一个未来的维护的噩梦。我敢肯定有一个更好的方式,如果任何人都可以在该会非常感激......

Is there already a recognized algorithm out there that does this ? If not, how would you go about it ? It seems to need to be done recursively but I can't quite get my head round how to do it. I have a method I wrote that works to a limited extent, but it's inelegant, 60 lines longs and uses multiple nested foreach loops so it's a future maintenance nightmare. I'm sure there's a much better way, if anyone could point me in the right direction that'd be much appreciated...

推荐答案

此代码应工作:

public static class StemmingUtilities
{
    private class Node
    {
        public char? Value { get; private set; }
        public Node Parent { get; private set; }
        public List<Node> Children { get; private set; }
        public Node(char? c, Node parent)
        {
            this.Value = c;
            this.Parent = parent;
            this.Children = new List<Node>();
        }
    }

    public static string GetRegex(IEnumerable<string> tokens)
    {
        var root = new Node(null,null);
        foreach (var token in tokens)
        {
            var current = root;
            for (int i = 0; i < token.Length; i++)
            {
                char c = token[i];
                var node = current.Children.FirstOrDefault(x => x.Value.Value == c);
                if (node == null)
                {
                    node = new Node(c,current);
                    current.Children.Add(node);
                }
                current = node;
            }   
        }
        return BuildRexp(root);
    }

    private static string BuildRexp(Node root)
    {
        string s = "";
        bool addBracket = root.Children.Count > 1;
        // uncomment the following line to avoid first brakets wrapping (occurring in case of multiple root's children)
        // addBracket = addBracket && (root.Parent != null); 
        if (addBracket)
            s += "(";
        for(int i = 0; i < root.Children.Count; i++)
        {
            var child = root.Children[i];
            s += child.Value;
            s += BuildRexp(child);
            if (i < root.Children.Count - 1)
                s += "|";
        }
        if (addBracket)
            s += ")";
        return s;
    }
}



用法:

var toStem1 = new[] { "what", "why", "where", "when", "which" };
string reg1 = StemmingUtilities.GetRegex(toStem1);
// reg1 = "wh(at|y|e(re|n)|ich)"

string[] toStem2 = new[] { "why", "abc", "what", "where", "apple", "when" };
string reg2 = StemmingUtilities.GetRegex(toStem2);
// reg2 = "(wh(y|at|e(re|n))|a(bc|pple))"

编辑:结果
获得 REG2 =WH(Y |于| E(重| N) )| A(BC | pple)即没有第一覆盖支架,只需取消注释 BuildRexp


to get reg2 = "wh(y|at|e(re|n))|a(bc|pple)" i.e. without the first wrapping brackets, just uncomment the marked line in BuildRexp method.

这篇关于如何使正则表达式词干常用前缀?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆