将字符串分割词语的文化中立的方式 [英] Splitting a string into words in a culture neutral way

查看:160
本文介绍了将字符串分割词语的文化中立的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经想出了下面,旨在可变长度的文本拆分为进一步全文索引处理的话数组(停用词移除,接着词干)的方法。结果似乎是确定,但我想听听意见有多可靠这个实现将针对不同语言的文本。你会建议使用正则表达式这个呢?请注意,我选择反对使用String.Split(),因为这需要我通过所有已知seperators名单这正是我试图避免的时候我写的函数



PS:我不能使用像Lucene.Net一个完全成熟的全文搜索引擎有几个原因(Silverlight的,矫枉过正项目范围等)。

 公共字符串[] SplitWords(字符串文本)
{
布尔inWord =!Char.IsSeparator(文本[0])及&放大器; !Char.IsControl(文本[0]);
VAR的结果=新的List<串GT;();
变种sbWord =新的StringBuilder();

的for(int i = 0; I< Text.Length;我++)
{
字符C =文本[我]

//非分隔符字符?
如果(Char.IsSeparator(C)及!&安培;!Char.IsControl(C))(!inWord)
{
如果
{
sbWord =新的StringBuilder();
inWord = TRUE;
}

如果(Char.IsPunctuation(C)及!&安培;!Char.IsSymbol(C))
sbWord.Append(C);
}

//这是一个分离器或控制字符
,否则
{
如果(inWord)
{
串字= sbWord.ToString();
如果(word.Length大于0)
result.Add(字);

sbWord.Clear();
inWord = FALSE;
}
}
}

返回result.ToArray();
}


解决方案

既然你在文化中性说这样,我真怀疑,如果正则表达式(字边界:\b)会做。我用Google搜索了一下,发现这个。希望这将是有益的。结果
我非常惊讶的是,没有内置Java的的BreakIterator 相当于...


I've come up with the method below that aims to split a text of variable length into an array of words for further full text index processing (stop word removal, followed by stemmer). The results seem to be ok but I would like to hear opinions how reliable this implementation would against texts in different languages. Would you recommend using a regex for this instead? Please note that I've opted against using String.Split() because that would require me to pass a list of all known seperators which is exactly what I was trying to avoid when I wrote the function

P.S: I can't use a full blown full text search engine like Lucene.Net for several reasons (Silverlight, Overkill for project scope etc).

public string[] SplitWords(string Text)
{
    bool inWord = !Char.IsSeparator(Text[0]) && !Char.IsControl(Text[0]);
    var result = new List<string>();
    var sbWord = new StringBuilder();

    for (int i = 0; i < Text.Length; i++)
    {
        Char c = Text[i];

        // non separator char?
        if(!Char.IsSeparator(c) && !Char.IsControl(c))
        {
            if (!inWord)
            {
                sbWord = new StringBuilder();
                inWord = true;
            }

            if (!Char.IsPunctuation(c) && !Char.IsSymbol(c))
                sbWord.Append(c);
        }

        // it is a separator or control char
        else
        {
            if (inWord)
            {
                string word = sbWord.ToString();
                if (word.Length > 0)
                    result.Add(word);

                sbWord.Clear();
                inWord = false;
            }
        }
    }

    return result.ToArray();
}

解决方案

Since you said in culture neutral way, I really doubt if Regular Expression (word boundary: \b) will do. I have googled a bit and found this. Hope it would be useful.
I am pretty surprised that there is no built-in Java BreakIterator equivalent...

这篇关于将字符串分割词语的文化中立的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆