mysql-全文索引-什么是自然语言模式(mysql - fulltext index - what is natural language mode)

47 IT屋

I have a question regarding this article: http://dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html.

Here I found queries like

SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);

What I don't understand is what exactly is natural language mode? I find no exact definition nowhere.

Can any1 provide a definition? How does it work?

解决方案

MySQL's Natural Language Full-Text Searches aim to match search queries against a corpus to find the most relevant matches. So assume we have an article that contains "I love pie" and we have documents d1, d2, d3 (the database in your case). Document 1 and 2 are about sports and religion respectively, and document 3 is about food. Your query,

SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE);

Will return d3, and then d2, d1 (random order of d2,d1 depending on which is more equal to the article) because d3 matches the article best.

The underlying algorithm MYSQL uses is probably the tf-idf algorithm, where tf stands for term frequency and idf for inverse document frequency. tf is as it says, just the number of times a word w in article occurs in A document. idf is based on in how many documents the word occurs. So words that occur in many documents don't contribute to deciding the most representative document. The product of tf*idf produces a score, the higher, the better the word represents a document. So 'pie' will only occur in document d3 and will thus have a high tf and a high idf (since it's the inverse). Whereas 'the' will have a high tf but a low idf which will event out the tf and give a low score.

The MYSQL Natural Language Mode also comes with a set of stopwords (the, a, some etc) and removes words that are less than 4 letters. Which can be seen in the link you provided.

Some words are ignored in full-text searches:

Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. You can control the cutoff by setting a configuration option before creating the index: innodb_ft_min_token_size configuration option for InnoDB search indexes, or ft_min_word_len for MyISAM.

Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overridden by a user-defined list. The stopword lists and related configuration options are different for InnoDB search indexes and MyISAM ones. Stopword processing is controlled by the configuration options innodb_ft_enable_stopword, innodb_ft_server_stopword_table, and innodb_ft_user_stopword_table for InnoDB search indexes, and ft_stopword_file for MyISAM ones.

我对本文有疑问: http: //dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html



在这里我找到了类似

 选择*来自文章
匹配(标题,正文)
反对("数据库"以自然语言模式显示) );


我不明白什么是自然语言模式?我找不到确切的定义。



任何人都可以提供一个定义吗?


解决方案

MySQL的自然语言全文搜索旨在将搜索查询与语料库进行匹配,以找到最相关的火柴。因此,假设我们有一篇文章包含"我爱馅饼",并且我们有文档d1,d2,d3(您所用的数据库)。文件1和2分别与体育和宗教有关,文件3与食品有关。您的查询




SELECT *从文章匹配(标题,正文)反对(
自然语言模式下的"数据库") ;




将返回d3,然后返回d2,d1(d2,d1的随机顺序,取决于哪个更等于文章)因为d3与文章最匹配。



MYSQL使用的基础算法可能是tf-idf算法,其中tf代表术语频率,idf代表逆文档频率。就像它所说的,tf恰好是文章中单词w在A文档中出现的次数。 idf基于单词出现的文档数量。因此,许多文档中出现的词语不会有助于确定最具代表性的文档。 tf * idf的乘积产生一个分数,分数越高,该词代表的文档越好。因此," pie"将仅出现在文档d3中,因此具有较高的tf和较高的idf(因为它是反函数)。而" the"的tf值较高,而idf值较低,这会导致tf值较低,并且得分较低。



MYSQL自然语言模式还附带了一套停用词(the,a,some等),并删除少于4个字母的单词。可以在您提供的链接中看到。




在全文搜索中某些词会被忽略:



任何太短的单词都会被忽略。全文搜索发现的默认最小单词长度为
InnoDB搜索索引为三个字符,而对于MyISAM为四个字符。您可以通过在创建
索引之前设置配置选项来控制
的截止值:InnoDB搜索
索引的innodb_ft_min_token_size配置选项,或MyISAM的ft_min_word_len。



停用词列表中的单词将被忽略。停用词是诸如" the"或" some"之类的词,它很常见以至于它被认为具有
零语义值。有一个内置的停用词列表,但是可以由用户定义的列表覆盖
。对于InnoDB搜索索引和
MyISAM索引,停用词列表和相关的
配置选项是不同的。停用词处理由配置
选项innodb_ft_enable_stopword,innodb_ft_server_stopword_table,
和innodb_ft_user_stopword_table(用于InnoDB搜索索引)以及
ft_stopword_file(用于MyISAM索引)控制。



本文地址:IT屋 » mysql-全文索引-什么是自然语言模式