如何优化 MySQL Boolean Full-Text Search?(或者用什么代替它?) - C# [英] How to optimize MySQL Boolean Full-Text Search? (Or what to replace it with?) - C#

查看:16
本文介绍了如何优化 MySQL Boolean Full-Text Search?(或者用什么代替它?) - C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含大约 22000 行的表格,我使用了布尔全文搜索来找到我感兴趣的内容.我的问题是我创建了一个由 DataGridView 在每个 TextChanged 事件后刷新.您可能已经发现,在每个事件之后搜索插入的字符串需要花费大量时间.

我可以做些什么来提高搜索速度?

欢迎提出任何建议!

解决方案

首先,您应该意识到 RDBMS 对全文索引的支持是一种黑客技术,可以强制使用旨在允许有效访问结构化数据的技术来处理非结构化文本.(是的,这只是我的意见.如果需要,我可以为它辩护,因为我对这两种技术都非常了解.;)

那么,可以采取哪些措施来提高搜索性能?

选项一 - 完成任务的最佳工具"

在文档语料库中处理全文搜索的最佳方法是使用专门为此设计的技术,例如 SOLR (Lucene) 来自 Apache 或 Sphinx 来自 err, Sphinx.

出于下面将清楚说明的原因,我强烈推荐这种方法.

选项二 - 预加载您的结果

在构建基于文本的搜索解决方案时,通常的方法是将所有文档索引到一个可搜索的索引中,虽然这可能是最方便的方法,但并不是唯一的方法.

假设您要搜索的内容可以轻松量化为一组已知规则,那么您可以提供更多引导式"搜索方式,而不是简单的不合格全文.我的意思是,如果您的应用程序可以从引导用户获得结果中受益,您可以根据一组已知的规则将各种结果集预加载到他们自己的表中,从而减少要搜索的大量数据.

如果您希望您的大多数用户能够以已知顺序从一组已知的搜索词中受益,您可以构建您的搜索 UI 以支持这些词.

因此,假设大多数用户都在寻找各种汽车,您可能会根据车型、年份、状况等提供预定义的搜索.您的搜索 UI 将被制作为一系列下拉菜单,以引导"用户到具体结果.

或者,如果大多数搜索将针对特定的主要主题(例如汽车"),您可以预先定义一个表格,其中仅包含您之前确定为与汽车相关的那些记录.

这两种方法都会减少要搜索的记录数量,从而增加响应时间.

选项三 - 自己动手"

如果您无法将外部搜索技术集成到您的项目中并且预加载不是一种选择,那么仍有一些方法可以极大地缩短搜索查询响应时间,但它们会因您需要完成的内容以及您期望搜索的方式而有所不同进行.

如果您希望用户使用单个关键字或短语以及它们之间的布尔关系进行搜索,您可以考虑构建自己的 '倒排索引' 的语料库.(这是 MySQL 的 Boolean Full-Text Search 已经做到的,但自己动手可以更好地控制搜索的速度和准确性.)

从现有数据构建倒排索引:

第一步,创建三个表

<前>//dict - 语料库中每个唯一单词包含一行的字典创建表字典(id int 主键,单词 varchar)//invert - 一个inverted_index,用于将单词映射到语料库中的记录创建表反转(id int 主键,rec_id 整数,word_id 整数)//停用词 - 包含索引时要忽略的词(如 a、an、the 等)创建表停用词(id int 主键,单词 varchar)

注意:这只是一个草图.在实际创建这些表时,您需要添加索引和约束等.

停用词表用于将索引的大小减少到仅那些对用户预期查询重要的词.例如,索引英语文章很少有用,如a"、an"、the",因为它们对关键字搜索没有有用的意义.

通常,您需要一个专门制作以满足您的应用程序需要的停用词列表.如果您从不希望用户在他们的查询中包含红色"、白色"或蓝色"这些词,或者如果这些词出现在每条可搜索记录中,您可能希望将它们添加到您的停用词列表中.

有关在 MySQL 中使用您自己的停用词列表的说明,请参阅本消息末尾的注释.

另见:

步骤 2. 构建倒排索引

要从现有记录构建倒排索引,您需要(伪代码):

<前>foreach( word(w) in record(r) ) {if(w 不在停用词中) {if( w 在字典中不存在) {在 w.id 处将 w 插入到字典中}插入(r.id,w.id)到inverted_index}}

更多关于停用词:

不是使用特定的停用词列表,if(w is not in stopwords)"测试可以做出其他决定,可以代替或作为您不可接受的词列表的附属物.

您的应用程序可能希望过滤掉所有少于 4 个字符的单词,或者只从预定义的集合中包括 个单词.

通过创建自己的倒排索引,您可以获得对搜索的更大更细粒度的控制.

步骤 3. 使用 SQL 查询倒排索引

这一步实际上取决于您期望查询提交到索引的方式.

如果查询是硬编码",您可以简单地自己创建选择语句,或者如果您需要支持用户输入的查询,您需要将您选择的任何查询语言转换为 SQL 语句(通常使用简单的解析器完成).

假设您希望检索与逻辑查询(word1 AND word2) OR word3"匹配的所有文档,可能的方法是:

CREATE TEMPORARY TABLE temp_results ( rec_id int, count int ) AS( SELECT rec_id, COUNT(rec_id) AS 计数FROM invert AS I, dict AS DWHERE I.word_id=D.id AND (D.word='word1' OR D.word='word2')GROUP BY I.rec_id有计数= 2)联合 (SELECT rec_id, 1 AS 计数FROM invert AS I, dict AS DWHERE I.word_id=D.id AND D.word='word3');SELECT DISTINCT rec_id FROM temp_results;删除表 temp_results;

注意:这只是我头顶上的第一次传球.我相信有更有效的方法可以将布尔查询表达式转换为高效的 SQL 语句,并欢迎任何和所有改进建议.

要搜索词组,您需要向倒排索引添加一个字段,以表示该词在其记录中出现的位置,并将其纳入您的 SELECT.

最后,您需要在添加新记录或删除旧记录时更新倒排索引.

最后一句话

全文搜索"属于一个非常大的研究领域,称为信息检索"或 IR,并且有很多关于该主题的书籍,包括

查看亚马逊了解更多信息.

注意事项

如何在 MySQL 中使用自己的停用词列表

在 MySQL 中使用您自己的停用词列表:

  1. 创建您自己的停用词列表,每行一个词,并将其保存到服务器上的已知位置,例如:/usr/local/lib/IR/stopwords.txt

  2. 编辑 my.cnf 以添加或更新以下行:

    [mysqld]ft_min_word_len=1ft_max_word_len=40ft_stopword_file=/usr/local/lib/IR/stopwords.txt

    将合法单词的最小和最大长度设置为 1 和 40,分别告诉 mysqld 在哪里可以找到自定义的停用词列表.

    (注意:默认的 ft_max_word_len 是 84,我认为这太高了并且可能导致非真实单词的字符串被索引.)

  3. 重启mysqld

  4. 删除并重新创建所有与全文相关的索引

I have a table that contains approximately 22000 rows and I used a Boolean Full-Text Search in order to find what I`m interested in. My problem is that I created a 'dynamic search feeling' that consists of a DataGridView that it is refreshed after every TextChanged event. As you might have figured out it takes a lot of time to search for the inserted string after every event.

What could I do in order to improve the search speed?

Any suggestions are welcomed!

解决方案

First, you should realize that RDBMS support for full text indexing is a hack to force a technology designed to allow efficient access to structured data to deal with unstructured text. (Yes, that's just my opinion. If required, I can defend it as I understand both technologies extremely well. ;)

So, what can be done to improve search performance?

Option One - "The Best Tool For The Task"

The best way to handle full-text search within a corpus of documents is the use technology specifically designed to do so, such as SOLR (Lucene) from Apache or Sphinx from err, Sphinx.

For reasons that will become clear below, I strongly recommend this approach.

Option Two - Preload Your Results

When constructing text-based search solutions, the usual approach is to index all documents into a single searchable index and while this might be the most expedient, it is not the only approach.

Assuming what you're searching for can be easily quantified into a set of known rules, you could offer more of a "guided" style of search than simply unqualified full-text. What I mean by this is, if your application might benefit from guilding users to results, you can preload various sets of results based on a known set of rules into their own tables, and thus reduce the bulk of data to be searched.

If you expect a majority of your users will benefit from a known set of search terms in a known order, you can construct your search UI to favor those terms.

So assuming a majority of users are looking for a variety of automobile, you might offer predefined searches based on model, year, condition, etc. Your search UI would be crafted as a series of dropdown menus to "guide" users to specific results.

Or if a majority of searches will be for a specific main topic (say 'automobiles') you could predefine a table of only those records you've previously identified as being related to automobiles.

Both of these approaches would reduce the number of records to be searched and so, increase response times.

Option Three - "Roll Your Own"

If you cannot integrate an external search technology into your project and preloading isn't an option, there are still ways to vastly improve search query response times, but they differ based on what you need to accomplish and how you expect searches to be carried out.

If you expect users to search using single keywords or phrases and boolean relationships between them, you might consider constructing your own 'inverted index' of your corpus. (This is what MySQL's Boolean Full-Text Search already does, but doing it yourself allows greater control over both the speed and accuracy of search.)

To build an inverted index from your existing data:

Step 1. Create three tables

    // dict - a dictionary containing one row per unique word in corpus  
    create table dict (    
      id int primary key,  
      word varchar  
    )

    // invert - an inverted_index to map words to records in corpus  
    create table invert (    
      id int primary key,  
      rec_id int,  
      word_id int  
    )

    // stopwords - to contain words to ignore when indexing (like a, an, the, etc)
    create table stopwords ( 
      id int primary key,  
      word varchar  
    )

Note: This is just a sketch. You'll want to add indexes and constraints, etc. when you actually create these tables.

The stopwords table is used to reduce the size of your index to only those words that matter to users' expected queries. For example, it's rarely useful to index English articles, like 'a', 'an', 'the', since they do not contribute useful meaning to keyword searches.

Typically, you'll require a stopword list specifically crafted to the needs of your application. If you never expect users to include the terms 'red', 'white' or 'blue' in their queries or if these terms appear in every searchable record, you would want to add them to your stopword list.

See the note at the end of this message for instructions on using your own stopwords list in MySQL.

See also:

Step 2. Build the Inverted Index

To build an inverted index from your existing records, you'll need to (pseudo-code):

    foreach( word(w) in record(r) ) {
      if(w is not in stopwords) {
        if( w does not exist in dictionary) {
          insert w to dictionary at w.id
        }
        insert (r.id, w.id) into inverted_index
      }
    }

More on stopwords:

nstead of using a specific stopword list, the 'if(w is not in stopwords)' test could make other decisions either instead of or as an adjunct to your list of unacceptable words.

Your application might wish to filter out all words less than 4 characters long or to only include words from a predefined set.

By creating your own inverted index, you gain far greater and finer-grained control over search.

Step 3. Query the Inverted Index Using SQL

This step really depends on how you expect queries to submitted to your index.

If queries are to be 'hard-coded', you can simply create the select statement yourself or if you need to support user-entered queries, you'll need to convert whatever query language you choose into an SQL statement (typically done using a simple parser).

Assuming you wish to retrieve all documents matching the logical query '(word1 AND word2) OR word3', a possible approach might be:

CREATE TEMPORARY TABLE temp_results ( rec_id int, count int ) AS 
    ( SELECT rec_id, COUNT(rec_id) AS count 
      FROM invert AS I, dict AS D 
      WHERE I.word_id=D.id AND (D.word='word1' OR D.word='word2') 
      GROUP BY I.rec_id 
      HAVING count=2
    ) 
    UNION (
      SELECT rec_id, 1 AS count 
      FROM invert AS I, dict AS D
      WHERE I.word_id=D.id AND D.word='word3'
    );

SELECT DISTINCT rec_id FROM temp_results;

DROP TABLE temp_results;

NOTE: This is just a first pass off the top of my head. I am confident there are more efficient ways of converting a boolean query expression into an efficient SQL statement and welcome any and all suggestions for improvement.

To search for phrases, you'll need to add a field to the inverted index to represent the position the word appeared within its record and factor that into your SELECT.

And finally, you'll need to update your inverted index as you add new records or delete old ones.

Final Word

"Full text search" falls under a very large area of research known as "Information Retrieval" or IR and there are many books on the subject, including

Check Amazon for more.

Notes

How To Use Your Own List of Stopwords in MySQL

To use your own stopword list in MySQL:

  1. Create your own list of stopwords, one word per line, and save it to a known location on your server, say: /usr/local/lib/IR/stopwords.txt

  2. Edit my.cnf to add or update the following lines:

        [mysqld]  
        ft_min_word_len=1    
        ft_max_word_len=40  
        ft_stopword_file=/usr/local/lib/IR/stopwords.txt
    

    which will set the minimum and maximum length of legal words to 1 and 40, respectively, and tell mysqld where to find your custom list of stopwords.

    (Note: the default ft_max_word_len is 84, which I believe is pretty excessive and can cause runs of strings that are not real words to be indexed.)

  3. Restart mysqld

  4. Drop and recreate all full-text related indices

这篇关于如何优化 MySQL Boolean Full-Text Search?(或者用什么代替它?) - C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆