mysql-全文索引-什么是自然语言模式 [英] mysql - fulltext index - what is natural language mode

查看:338
本文介绍了mysql-全文索引-什么是自然语言模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对本文有疑问: http://dev.mysql.com/doc/refman/5.6/zh-CN/fulltext-natural-language.html .

在这里我找到了类似

SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);

我不明白什么是自然语言模式?我找不到确切的定义.

What I don't understand is what exactly is natural language mode? I find no exact definition nowhere.

any1可以提供一个定义吗?如何运作?

Can any1 provide a definition? How does it work?

推荐答案

MySQL的自然语言全文搜索旨在将搜索查询与语料库进行匹配,以找到最相关的匹配项.因此,假设我们有一篇包含我爱馅饼"的文章,并且我们有文档d1,d2,d3(您所用的数据库).文件1和2分别与体育和宗教有关,文件3与食品有关.您的查询,

MySQL's Natural Language Full-Text Searches aim to match search queries against a corpus to find the most relevant matches. So assume we have an article that contains "I love pie" and we have documents d1, d2, d3 (the database in your case). Document 1 and 2 are about sports and religion respectively, and document 3 is about food. Your query,

选择*从文章匹配(标题,正文)反对(数据库"输入 自然语言模式);

SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE);

将返回d3,然后返回d2,d1(d2,d1的随机顺序,取决于哪个更等于商品),因为d3与商品最匹配.

Will return d3, and then d2, d1 (random order of d2,d1 depending on which is more equal to the article) because d3 matches the article best.

MYSQL使用的基础算法可能是tf-idf算法,其中tf代表术语频率,idf代表逆文档频率.就像它所说的那样,t只是文章中单词w在A文档中出现的次数. idf基于单词出现的文档数量.因此,许多文档中出现的单词不会有助于确定最具代表性的文档. tf * idf的乘积产生一个分数,分数越高,该词代表的文档越好.因此,"pie"将仅出现在文档d3中,因此将具有较高的tf和较高的idf(因为它是相反的).而"the"的tf值较高,而idf值较低,这会导致tf值降低,并且得分较低.

The underlying algorithm MYSQL uses is probably the tf-idf algorithm, where tf stands for term frequency and idf for inverse document frequency. tf is as it says, just the number of times a word w in article occurs in A document. idf is based on in how many documents the word occurs. So words that occur in many documents don't contribute to deciding the most representative document. The product of tf*idf produces a score, the higher, the better the word represents a document. So 'pie' will only occur in document d3 and will thus have a high tf and a high idf (since it's the inverse). Whereas 'the' will have a high tf but a low idf which will event out the tf and give a low score.

MYSQL自然语言模式还带有一组停用词(the,a,some等),并删除了少于4个字母的单词.可以在您提供的链接中看到.

The MYSQL Natural Language Mode also comes with a set of stopwords (the, a, some etc) and removes words that are less than 4 letters. Which can be seen in the link you provided.

在全文搜索中,某些单词会被忽略:

Some words are ignored in full-text searches:

任何太短的单词都会被忽略.全文搜索发现的默认最小单词长度为三个字符 InnoDB搜索索引,或MyISAM的四个字符.你可以控制 在创建 索引:InnoDB搜索的innodb_ft_min_token_size配置选项 索引,或者对于MyISAM为ft_min_word_len.

Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. You can control the cutoff by setting a configuration option before creating the index: innodb_ft_min_token_size configuration option for InnoDB search indexes, or ft_min_word_len for MyISAM.

停用词列表中的单词将被忽略.停用词是诸如"the"或"some"之类的词,它很常见以至于被认为具有 零语义值.有内置的停用词列表,但可以 被用户定义的列表覆盖.停用词列表及相关 InnoDB搜索索引的配置选项不同,并且 MyISAM的.停用词处理由配置控制 选项innodb_ft_enable_stopword,innodb_ft_server_stopword_table, 和用于InnoDB搜索索引的innodb_ft_user_stopword_table,以及 ft_stopword_file用于MyISAM.

Words in the stopword list are ignored. A stopword is a word such as "the" or "some" that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overridden by a user-defined list. The stopword lists and related configuration options are different for InnoDB search indexes and MyISAM ones. Stopword processing is controlled by the configuration options innodb_ft_enable_stopword, innodb_ft_server_stopword_table, and innodb_ft_user_stopword_table for InnoDB search indexes, and ft_stopword_file for MyISAM ones.

这篇关于mysql-全文索引-什么是自然语言模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆