将字符串分割词语的文化中立的方式 [英] Splitting a string into words in a culture neutral way

查看：160 发布时间：2016/10/1 1:06:21 c# full-text-search

本文介绍了将字符串分割词语的文化中立的方式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经想出了下面，旨在可变长度的文本拆分为进一步全文索引处理的话数组（停用词移除，接着词干）的方法。结果似乎是确定，但我想听听意见有多可靠这个实现将针对不同语言的文本。你会建议使用正则表达式这个呢？请注意，我选择反对使用String.Split（），因为这需要我通过所有已知seperators名单这正是我试图避免的时候我写的函数

PS：我不能使用像Lucene.Net一个完全成熟的全文搜索引擎有几个原因（Silverlight的，矫枉过正项目范围等）。

 公共字符串[] SplitWords（字符串文本）
 {
布尔inWord =！Char.IsSeparator（文本[0]）及&放大器; ！Char.IsControl（文本[0]）; 
 VAR的结果=新的List<串GT;（）; 
变种sbWord =新的StringBuilder（）; 
 
的for（int i = 0; I< Text.Length;我++）
 {
字符C =文本[我] 
 
 //非分隔符字符？ 
如果（Char.IsSeparator（C）及！&安培;！Char.IsControl（C））（！inWord）
 {
如果
 {
 sbWord =新的StringBuilder（）; 
 inWord = TRUE; 
} 
 
如果（Char.IsPunctuation（C）及！&安培;！Char.IsSymbol（C））
 sbWord.Append（C）; 
} 
 
 //这是一个分离器或控制字符
，否则
 {
如果（inWord）
 {
串字= sbWord.ToString（）; 
如果（word.Length大于0）
 result.Add（字）; 
 
 sbWord.Clear（）; 
 inWord = FALSE; 
} 
} 
} 
 
返回result.ToArray（）; 
}

解决方案

既然你在文化中性说这样，我真怀疑，如果正则表达式（字边界：\b）会做。我用Google搜索了一下，发现这个。希望这将是有益的。结果
我非常惊讶的是，没有内置Java的的BreakIterator 相当于...

I've come up with the method below that aims to split a text of variable length into an array of words for further full text index processing (stop word removal, followed by stemmer). The results seem to be ok but I would like to hear opinions how reliable this implementation would against texts in different languages. Would you recommend using a regex for this instead? Please note that I've opted against using String.Split() because that would require me to pass a list of all known seperators which is exactly what I was trying to avoid when I wrote the function

P.S: I can't use a full blown full text search engine like Lucene.Net for several reasons (Silverlight, Overkill for project scope etc).

public string[] SplitWords(string Text)
{
    bool inWord = !Char.IsSeparator(Text[0]) && !Char.IsControl(Text[0]);
    var result = new List<string>();
    var sbWord = new StringBuilder();

    for (int i = 0; i < Text.Length; i++)
    {
        Char c = Text[i];

        // non separator char?
        if(!Char.IsSeparator(c) && !Char.IsControl(c))
        {
            if (!inWord)
            {
                sbWord = new StringBuilder();
                inWord = true;
            }

            if (!Char.IsPunctuation(c) && !Char.IsSymbol(c))
                sbWord.Append(c);
        }

        // it is a separator or control char
        else
        {
            if (inWord)
            {
                string word = sbWord.ToString();
                if (word.Length > 0)
                    result.Add(word);

                sbWord.Clear();
                inWord = false;
            }
        }
    }

    return result.ToArray();
}

解决方案

Since you said in culture neutral way, I really doubt if Regular Expression (word boundary: \b) will do. I have googled a bit and found this. Hope it would be useful.
I am pretty surprised that there is no built-in Java BreakIterator equivalent...

这篇关于将字符串分割词语的文化中立的方式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将字符串分割词语的文化中立的方式 [英] Splitting a string into words in a culture neutral way

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

将字符串分割词语的文化中立的方式 [英] Splitting a string into words in a culture neutral way

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭