分词统计方法 [英] Word splitting statistical approach
问题描述
我想解决分词问题(解析长字符串中没有空格的单词).
例如,我们希望从somelongword
到[some, long, word]
中提取单词.
I want to solve word splitting problem (parse words from long string with no spaces).
For examle we want extract words from somelongword
to [some, long, word]
.
我们可以使用字典的某种动态方法来实现这一目标,但是我们遇到的另一个问题是解析歧义. IE. orcore
=> or core
或orc ore
(我们不考虑词组含义或词性).所以我考虑使用某种统计或机器学习方法.
We can achieve this by some dynamic approach with dictionary, but another issue we encounter is parsing ambiguity. I.e. orcore
=> or core
or orc ore
(We don't take into account phrase meaning or part of speech). So i think about usage of some statistical or ML approach.
我发现带有训练集的朴素贝叶斯和维特比算法可用于解决此问题.您能为我指出一些有关这些算法在分词问题中的应用信息吗?
I found that Naive Bayes and Viterbi algorithm with train set can be used for solving this. Can you point me some information about application of these algorithms to word splitting problem?
UPD:我已经根据Peter Norvig的代码的一些建议,在Clojure上实现了此方法.
UPD: I've implemented this method on Clojure, using some advices from Peter Norvig's code
推荐答案
我认为幻灯片显示Peter Norvig和Sebastian Thurn撰写的是一个不错的起点.它展示了Google制作的真实作品.
I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.
这篇关于分词统计方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!