分词统计方法 [英] Word splitting statistical approach

查看:77
本文介绍了分词统计方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解决分词问题(解析长字符串中没有空格的单词). 例如,我们希望从somelongword[some, long, word]中提取单词.

I want to solve word splitting problem (parse words from long string with no spaces). For examle we want extract words from somelongword to [some, long, word].

我们可以使用字典的某种动态方法来实现这一目标,但是我们遇到的另一个问题是解析歧义. IE. orcore => or coreorc ore(我们不考虑词组含义或词性).所以我考虑使用某种统计或机器学习方法.

We can achieve this by some dynamic approach with dictionary, but another issue we encounter is parsing ambiguity. I.e. orcore => or core or orc ore (We don't take into account phrase meaning or part of speech). So i think about usage of some statistical or ML approach.

我发现带有训练集的朴素贝叶斯和维特比算法可用于解决此问题.您能为我指出一些有关这些算法在分词问题中的应用信息吗?

I found that Naive Bayes and Viterbi algorithm with train set can be used for solving this. Can you point me some information about application of these algorithms to word splitting problem?

UPD:我已经根据Peter Norvig的代码的一些建议,在Clojure上实现了此方法.

UPD: I've implemented this method on Clojure, using some advices from Peter Norvig's code

推荐答案

我认为幻灯片显示Peter Norvig和Sebastian Thurn撰写的是一个不错的起点.它展示了Google制作的真实作品.

I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.

这篇关于分词统计方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆