将字符串分割词语的文化中立的方式 [英] Splitting a string into words in a culture neutral way
问题描述
我已经想出了下面,旨在可变长度的文本拆分为进一步全文索引处理的话数组(停用词移除,接着词干)的方法。结果似乎是确定,但我想听听意见有多可靠这个实现将针对不同语言的文本。你会建议使用正则表达式这个呢?请注意,我选择反对使用String.Split(),因为这需要我通过所有已知seperators名单这正是我试图避免的时候我写的函数
PS:我不能使用像Lucene.Net一个完全成熟的全文搜索引擎有几个原因(Silverlight的,矫枉过正项目范围等)。
公共字符串[] SplitWords(字符串文本)
{
布尔inWord =!Char.IsSeparator(文本[0])及&放大器; !Char.IsControl(文本[0]);
VAR的结果=新的List<串GT;();
变种sbWord =新的StringBuilder();
的for(int i = 0; I< Text.Length;我++)
{
字符C =文本[我]
//非分隔符字符?
如果(Char.IsSeparator(C)及!&安培;!Char.IsControl(C))(!inWord)
{
如果
{
sbWord =新的StringBuilder();
inWord = TRUE;
}
如果(Char.IsPunctuation(C)及!&安培;!Char.IsSymbol(C))
sbWord.Append(C);
}
//这是一个分离器或控制字符
,否则
{
如果(inWord)
{
串字= sbWord.ToString();
如果(word.Length大于0)
result.Add(字);
sbWord.Clear();
inWord = FALSE;
}
}
}
返回result.ToArray();
}
既然你在文化中性说这样,我真怀疑,如果正则表达式(字边界:\b)会做。我用Google搜索了一下,发现这个。希望这将是有益的。结果
我非常惊讶的是,没有内置Java的的BreakIterator 相当于...
I've come up with the method below that aims to split a text of variable length into an array of words for further full text index processing (stop word removal, followed by stemmer). The results seem to be ok but I would like to hear opinions how reliable this implementation would against texts in different languages. Would you recommend using a regex for this instead? Please note that I've opted against using String.Split() because that would require me to pass a list of all known seperators which is exactly what I was trying to avoid when I wrote the function
P.S: I can't use a full blown full text search engine like Lucene.Net for several reasons (Silverlight, Overkill for project scope etc).
public string[] SplitWords(string Text)
{
bool inWord = !Char.IsSeparator(Text[0]) && !Char.IsControl(Text[0]);
var result = new List<string>();
var sbWord = new StringBuilder();
for (int i = 0; i < Text.Length; i++)
{
Char c = Text[i];
// non separator char?
if(!Char.IsSeparator(c) && !Char.IsControl(c))
{
if (!inWord)
{
sbWord = new StringBuilder();
inWord = true;
}
if (!Char.IsPunctuation(c) && !Char.IsSymbol(c))
sbWord.Append(c);
}
// it is a separator or control char
else
{
if (inWord)
{
string word = sbWord.ToString();
if (word.Length > 0)
result.Add(word);
sbWord.Clear();
inWord = false;
}
}
}
return result.ToArray();
}
Since you said in culture neutral way, I really doubt if Regular Expression (word boundary: \b) will do. I have googled a bit and found this. Hope it would be useful.
I am pretty surprised that there is no built-in Java BreakIterator equivalent...
这篇关于将字符串分割词语的文化中立的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!