将句子拆分成单独的单词 [英] Split a sentence into separate words
问题描述
我需要将中文句子拆分为单独的单词.中文的问题是没有空格.例如,该句子可能看起来像:主楼怎么走
(带空格的地方是:主楼 怎么 走
).
I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走
(with spaces it would be: 主楼 怎么 走
).
目前,我可以想到一种解决方案.我有一本有中文单词的字典(在数据库中).该脚本将:
At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:
-
尝试在数据库(
主楼
)中查找句子的前两个字符,
try to find the first two characters of the sentence in the database (
主楼
),
如果主楼
实际上是一个单词,并且它在数据库中,脚本将尝试查找前三个字符(主楼怎
). 主楼怎
不是单词,所以不在数据库中=>我的应用程序现在知道主楼
是一个单独的单词.
if 主楼
is actually a word and it's in the database the script will try to find first three characters (主楼怎
). 主楼怎
is not a word, so it's not in the database => my application now knows that 主楼
is a separate word.
尝试使用其余字符.
我真的不喜欢这种方法,因为即使分析很小的文本,它也会查询数据库太多次.
I don't really like this approach, because to analyze even a small text it would query the database too many times.
还有其他解决方案吗?
推荐答案
感谢大家的帮助!
经过一番研究,我发现了一些工作工具(牢记您的所有建议),这就是为什么我回答自己的问题.
After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.
Drupal模块,基本上是另一个PHP解决方案,具有4种不同的细分算法(非常容易理解它的工作原理)(
A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)
用于中文分词的PHP扩展程序( http://code.google.com/p/phpcws/)
A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)
如果您尝试在baidu.com上搜索中文分词",则还有其他解决方案
There are some other solutions availabe if you try searching baidu.com for "中文分词"
此致
Equ
这篇关于将句子拆分成单独的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!