单词之间没有空格的语言词汇突破(例如,亚洲)? [英] Word break in languages without spaces between words (e.g., Asian)?

查看:219
本文介绍了单词之间没有空格的语言词汇突破(例如,亚洲)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用日文和中文文本以及任何其他语言进行MySQL全文搜索工作。问题在于这些语言和其他人通常在单词之间没有空白。当您必须在文本中键入相同的句子时,搜索没有用处。



我不能只在每个角色之间放一个空格,因为英语也必须工作。我想用PHP或MySQL解决这个问题。



我可以配置MySQL来识别应该是自己的索引单元的字符吗?是否有一个PHP模块可以识别这些字符,这样我就可以在索引周围引入空格?



更新



部分解决方案:

  $ string_with_spaces = 
preg_replace(/ [ .json_decode('\\\一')。 - 。json_decode('\\\龜')。] /,
$ 0,$ string_without_spaces);

这使得我至少需要特殊处理一些角色中的角色类。我应该提到,这是可以接受的索引文本。



有人知道我需要插入空格的所有字符范围吗?



另外,在PHP中必须有更好的可移植的方式来表示这些字符? Literal Unicode中的源代码并不理想;我不会识别所有的角色;他们可能无法在我必须使用的所有机器上进行渲染。 解决方案

em>语言学方法 ,例如使用字典以及对基本词干规则的理解。



我听说过相对成功的全文搜索应用程序,它将每个单独的字符简单地拆分为一个单独的词,用中文简单地应用与最终用户提供的搜索条件相同的标记 。然后,搜索引擎为提供字符的文档提供更好的排名,即按与搜索条件相同的顺序提供文字。
我不确定这可以扩展到诸如日语这样的语言,因为Hirakana和Katagana字符集使文本更接近于带有简短字母表的欧洲语言。



编辑

资源

这个词分解问题以及相关问题非常强>非平凡,整本书都是关于它的。请参阅 CJKV信息处理(CJKV代表中文,日文,韩文和越南文;您也可以使用CJK关键词,因为在许多文章中,越南语不被讨论)。另请参阅日语中的单词是很难为这个主题的单页面。

理解地,涵盖这个主题的大部分材料是用其中一种底层本地语言编写的,因此,对于没有这些语言相对流利。出于这个原因,并且一旦你开始实施分词逻辑,你也应该帮助你验证搜索引擎,你应该寻求一两个母语的帮助。


各种想法

您对识别系统暗示字词分解的字符(例如引号,括号,连字符等字符)的想法很好,而且这可能是一些专业级别的破译者使用的一种启发式。然而,你应该寻找这样一份名单的权威来源,而不是从零开始,基于轶事的调查结果。

一个相关的想法是在<假名到汉字转换(但我不会猜测),也可能是<平假名到片假名或反过来转换。

与破字本身无关,索引可能[ - 或可能不 - ; ;-)]受益于系统地将每个平假名字符转换为相应的片假名字符。只是一个没有受过教育的想法!我对日语知之甚少,不知道这是否有帮助;直观地说,它可以类似于用几种欧洲语言将突出字母等系统转换为相应的非加重字母。



也许我之前提到的系统索引单个字符(以及基于它们的邻近顺序对搜索标准进行排序的搜索结果)可以稍微改变,例如通过保持连续的假名字符在一起,然后一些其他规则。 ..并产生一个不完善,但实际足够的搜索引擎。



如果不是这样的话,不要失望......如前所述,这是远远不够的,从长远来看,通过暂停和阅读一两本书可以为您节省时间和金钱。尝试了解更多理论和最佳实践的另一个原因是,目前您似乎专注于 word break ,但很快,搜索引擎可能会也受益于 stemming-awareness ;事实上,这两个问题至少在语言上是相关的,并且可能受益于一起处理。



祝您好运,但我们为此付出了努力。


I'd like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not normally have white space between words. Search is not useful when you must type the same sentence as is in the text.

I can not just put a space between every character because English must work too. I would like to solve this problem with PHP or MySQL.

Can I configure MySQL to recognize characters which should be their own indexing units? Is there a PHP module that can recognize these characters so I could just throw spaces around them for the index?

Update

A partial solution:

$string_with_spaces =
  preg_replace( "/[".json_decode('"\u4e00"')."-".json_decode('"\uface"')."]/",
  " $0 ", $string_without_spaces );

This makes a character class out of at least some of the characters I need to treat specially. I should probably mention, it is acceptable to munge the indexed text.

Does anyone know all the ranges of characters I'd need to insert spaces around?

Also, there must be a better, portable way to represent those characters in PHP? Source code in Literal Unicode is not ideal; I will not recognize all the characters; they may not render on all the machines I have to use.

解决方案

Word breaking for the languages mentioned require a linguistic approach, for example one that uses a dictionary along with an understanding of basic stemming rules.

I've heard of relatively successful full text search applications which simply split every single character as a separate word, in Chinese, simply applying the same "tokenization" of the search criteria supplied by the end-users. The search engine then provides a better ranking for the documents which supply the characters-words in the same order as the search criteria. I'm not sure this could be extended to Language such as Japanese, as the Hirakana and Katagana character sets make the text more akin to European languages with a short alphabet.

EDIT:
Resources
This word breaking problem, as well as related issues, is so non-trivial that whole books are written about it. See for example CJKV Information Processing (CJKV stands for Chinese, Japanese, Korean and Vietnamese; you may also use the CJK keyword, for in many texts, Vietnamese is not discussed). See also Word Breaking in Japanese is hard for a one-pager on this topic.
Understandingly, the majority of the material covering this topic is written in one of the underlying native languages, and is therefore of limited use for people without a relative fluency in these languages. For that reason, and also to help you validate the search engine once you start implementing the word breaker logic, you should seek the help of a native speaker or two.

Various ideas
Your idea of identifying characters which systematically imply a word break (say quotes, parenthesis, hyphen-like characters and such) is good, and that is probably one heuristic used by some of the professional grade word breakers. Yet, you should seek an authoritative source for such a list, rather than assembling one from scratch, based on anecdotal findings.
A related idea is to break words at Kana-to-Kanji transitions (but I'm guessing not the other way around), and possibly at Hiragana-to-Katakana or vice-versa transitions.
Unrelated to word-breaking proper, the index may [ -or may not- ;-)] benefit from the systematic conversion of every, say, hiragana character to the corresponding katakana character. Just an uneducated idea! I do not know enough about the Japanese language to know if that would help; intuitively, it would be loosely akin to the systematic conversion of accentuated letters and such to the corresponding non-accentuated letter, as practiced with several European languages.

Maybe the idea I mentioned earlier, of systematically indexing individual character (and of ranking the search results based on their proximity order-wise to the search criteria) can be slightly altered, for example by keeping consecutive kana characters together, and then some other rules... and produce a imperfect but practical enough search engine.

Do not be disappointed if this is not the case... As stated this is far from trivial, and it may save you time and money, in the long term, by taking a pause and reading a book or two. Another reason to try and learn more of the "theory" and best practices, is that at the moment you seem to be focused on word breaking but soon, the search engine may also benefit from stemming-awareness; indeed these two issues are, linguistically at least, related, and may benefit from being handled in tandem.

Good luck on this vexing but worthy endeavor.

这篇关于单词之间没有空格的语言词汇突破(例如,亚洲)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆