从SQL Server全文索引获取前n个最新条目 [英] Getting top n latest entries from SQL Server full text index

查看:105
本文介绍了从SQL Server全文索引获取前n个最新条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  Article(Id,art_text)

code>

Id是主键。
art_text有全文索引。



我搜索最新的包含house这样的文章:

  SELECT TOP 100 Id,art_text 
FROM Article
Where CONTAINS(art_text,'house')
ORDER BY Id DESC

这会返回正确的结果,但速度很慢(〜5秒)。该表有2000万行
,其中有35万行包含单词house。在查询计划中,我可以看到在全文索引返回的350,000个ID的聚集索引中执行索引扫描。


如果只能得到全文索引中包含单词'house'的最新100个条目,查询可能会更快。有没有办法以查询更快的方式做到这一点?

解决方案

简单回答是,有办法让这个特定的查询更快乐一些,但是有2000万行的语料库,5秒也不错。您需要认真考虑以下建议是否适合您的FT搜索工作负载,并权衡成本与收益。如果你盲目地实现这些,你将会遇到一个糟糕的时间。






改进Sql Server Full-文本搜索性能



缩小搜索的全文索引的大小 FT指数越小,查询速度更快。有几种方法可以降低FT指数的大小。前两项可能适用,也可能不适用,第三项需要付出相当大的努力才能完成。


  1. 单词噪音词是不会为全文搜索查询增加价值的词,如the,and,in等。如果有与业务相关的术语不增加任何值被索引,你可以从FT指数中排除他们。考虑MSDN库上的假设全文索引。诸如Microsoft,library,include,dll和reference等术语可能不会为搜索结果增加价值。 (去并搜索microsoft有没有什么实际价值? )法律意见书的FT指数可能会排除诸如被告,起诉和法律等字样。


  2. 使用iFilters的数据使用Windows iFilters从二进制文档中提取文本的全文搜索。这与窗口搜索功能用于搜索pdf和PowerPoint文档的技术相同。当你有一个可以包含HTML标记的描述列时,这种情况特别有用。默认情况下,Sql Server全文搜索将索引所有内容,因此您可以将诸如font-family,Arial和href等术语作为可搜索条件。使用HTML iFilter可以去除标记。

    在FT索引中使用iFilter的两个要求是索引列是一个VARBINARY,并且有一个包含文件扩展名的type列。这些都可以用计算列完成。

      CREATE TABLE t(
    ....
    description varbinary(max),
    FTS_description as(CAST(描述为VARBINARY(MAX)),
    FTS_filetype as(N'.html')

    - 然后创建全文索引FTS_description指定文件类型


  3. 结果有几种方法可以实现这一点,但总体思路是将表分成更小的块,分别查询块并合并结果。例如,您可以创建两个索引视图,一个用于当前年份和一个历史年份,其中包含全文索引,您返回100行的查询更改如下: @rows int
    DECLARE @ids table(id int not null primary key)

    INSERT INTO @ids(id)
    SELECT TOP(100)id
    FROM vw_2013_FTDocuments WHE RE CONTAINS(....)
    ORDER BY Id DESC
    SET @rows = @@ rowcount
    IF @rows< 100
    BEGIN
    DECLARE @rowsLeft int
    SET @rowsLeft = 100 - @rows
    INSERT INTO @ids(id)SELECT TOP(@rowsLeft)......
    - 加入历史数据的逻辑
    END
    SELECT ... FROM t INNER JOIN @ids .....

    这会导致查询时间大幅缩短,但会增加搜索逻辑的复杂度。当搜索通常限于数据的一个子集时,这种方法也适用。例如,craigslist可能有房屋的FT指数,一个用于待售,一个用于就业。从主页进行的任何搜索都将从各个索引中拼接在一起,而类别中常见的搜索情况效率更高。




不支持的技术可能会在未来版本的Sql Server中破解。



您需要广泛测试数据与生产相同的数量和质量。如果行为在未来版本的Sql服务器中发生变化,您将无权投诉。这是基于观察,而不是证据。在您自己的风险中使用! b
$ b

全文历史在Sql Server 2005中,全文搜索功能位于sqlservr.exe的外部进程中。 FTS被纳入查询计划的方式就像一个黑匣子。 Sql服务器会通过FTS查询,FTS会返回一个id的流。这限制了Sql Server可用于计划的计划,其中FTS操作员可基本上被视为表扫描。在Sql Server 2008中,FTS被集成到引擎中这提高了性能。它还为优化器提供了FTS查询计划的新选项。具体来说,它现在可以选择探索LOOP JOIN运算符中的FTS索引,以检查单个行是否与FTS谓词相匹配(请参阅 http://sqlblog.com/blogs/joe_chang/archive/2012/02/19/query-optimizer-gone -wild-full-text.aspx 对此进行了极好的讨论,以及事情可能出错的方式。)



我们最优FTS查询的要求计划有两个特征可以争取获得最佳查询计划。


  1. 没有排序操作。排序很慢,我们不想排序2000万行或350,000行。
  2. 不要返回与FTS谓词匹配的所有350k行。如果可能,我们需要避免这种情况。

这两个标准消除了使用散列连接的任何计划,因为散列连接需要占用一个输入来构建散列表。

对于使用循环连接的计划,有两种选择。向后扫描聚簇索引,并将每个行探针扫描到全文搜索引擎,以查看该特定行是否匹配。理论上,这似乎是一个很好的解决方案,因为一旦我们匹配100行,就完成了。我们可能需要尝试10000个ID才能找到匹配的100个,但这可能比阅读全部350k个更好。如果每个探针都很昂贵,那么它也可能会变得更糟(如上面链接到Joe Chang的博客),那么我们的10k探针可能比阅读所有的350k行要花费更多的时间。



另一个循环连接选项是让循环外侧的FTS部分,并寻找聚集索引。不幸的是,FTS引擎不喜欢以相反的顺序返回结果,因此我们必须读取所有的350k,然后将它们排序以返回前100个。



障碍在于让FTS引擎以相反的顺序返回行。 如果我们能够克服这一点,那么我们可以将IO减少为仅读取匹配的最后100行。幸运的是,FTS引擎有一个倾向,即按索引创建时指定的唯一索引的键顺序返回行。 (这是FTS引擎使用的内部存储的一个自然副作用)

通过添加一个计算列,该列是id的负数,并指定唯一索引当创建FT索引时,那么我们真的很接近。

  CREATE TABLE t(id int not null primary key ,txt varchar(max),neg_id as(-id)持久化)
CREATE UNIQUE INDEX IX_t_neg_id on t(neg_id)
CREATE FULLTEXT INDEX on t(txt)KEY INDEX IX_t_neg_id

现在对于我们的查询,我们将使用CONTAINSTABLE和一些左连接技巧来确保FTS谓词不会最终结束在LOOP JOIN里面。

  SELECT TOP(100)t.id,t.txt 
FROM CONTAINSTABLE (t,txt,'house')ft
LEFT JOIN t on tf。[Key] = t.neg_id ORDER BY tf。[key]

生成的计划应该是一个只读取FT索引中最后100行的循环连接。



可以吹掉这张牌的小风:


  • 复杂的FTS查询(如多个术语或使用NOT或OR运算符可能会导致Sql 2008+获得智能并将逻辑转换为查询计划中加入的多个FTS查询。 / li>
  • 任何累积更新,Service Pack或主要版本升级都可能导致此方法无效。
  • 它可能在95%的情况下工作,剩余5%。

  • 它可能根本不适用于您。 好运!


    I have a table in a SQL Server 2008 R2 database

    Article (Id, art_text)
    

    Id is the primary key. art_text has a full text index.

    I search for latest articles that contain the word 'house' like this:

    SELECT TOP 100 Id, art_text 
    FROM Article
    WHERE CONTAINS(art_text, 'house')
    ORDER BY Id DESC
    

    This returns the correct results but it is slow (~5 seconds). The table has 20 million rows and 350,000 of those contain the word house. I can see in the query plan that an index scan is performed in the clustered index for the 350,000 Ids returned by the full text index.

    The query could be much faster if there would be a way to get only the latest 100 entries in the full text index that contain the word 'house'. Is there any way to do this in a way that the query is faster?

    解决方案

    The short answer is yes, there are ways to make this particular query fun faster, but with a corpus of 20 million rows, 5 seconds isn't bad. You'll need to seriously consider whether the below suggestions are optimal for your FT search workload and weigh the costs vs the benefits. If you blindly implement these, you're going to have a bad time.


    General Suggestions for Improving Sql Server Full-text Search Performance

    Reduce the size of the Full-Text index being searched The smaller the FT Index, the faster the query. There are a couple of ways to reduce the FT index size. The first two may or may not apply and the third would take considerable work to accomplish.

    1. Add domain specific noise words Noise words are words that don't add value to full-text search queries, such as "the", "and", "in", etc. If there are terms related to the business that add no value being indexed, you may benefit from excluding them from the FT index. Consider a hypothetical full-text index on the MSDN library. Terms such as "Microsoft", "library", "include", "dll" and "reference" may not add value to search results. (Is there any real value in going to http://msdn.microsoft.com and searching for "microsoft"?) A FT index of legal opinions might exclude words such as "defendant", "prosecution" and "legal", etc.

    2. Strip out extraneous data using iFilters Full-Text search using Windows iFilters to extract text from binary documents. This is the same technology that window search functionality uses to search pdf and powerpoint documents. The one case where this is particularly useful is when you have a description column that can contain HTML markup. By default, Sql Server full-text search will index everything, so you get terms such as "font-family", "Arial" and "href" as searchable terms. Using the HTML iFilter can strip out the markup.

      The two requirements for using an iFilter in your FT index is that the indexed column is a VARBINARY and there is a "type" column that contains the file extension. Both these can be accomplished with computed columns.

      CREATE TABLE t (
      ....
      description varbinary(max),
      FTS_description as (CAST(description as VARBINARY(MAX)),
      FTS_filetype as ( N'.html' )
      )
      -- Then create the fulltext index on FTS_description specifying the filetype.
      

    3. Index portions of the table and stitch together results There are several ways to accomplish this, but the overall idea is to split the table into smaller chunks, query the chunks individually and combine the results. For example, you could create two indexed views, one for the current year and one for historical years with full-text indexes on them. Your query to return 100 rows changes to look like this:

      DECLARE @rows int
      DECLARE @ids table (id int not null primary key)
      
      INSERT INTO @ids (id)   
          SELECT TOP (100) id 
          FROM vw_2013_FTDocuments WHERE CONTAINS (....) 
          ORDER BY Id DESC 
      SET @rows = @@rowcount
      IF @rows < 100
      BEGIN
        DECLARE @rowsLeft int
        SET @rowsLeft = 100 - @rows
        INSERT INTO @ids (id) SELECT TOP (@rowsLeft) ......
        --Logic to incorporate the historic data
      END
      SELECT ... FROM t INNER JOIN @ids .....
      

      This can result in a substantial reduction in query times at the cost of adding complexity to the search logic. This approach is also applicable when searches are typically limited to a subset of the data. For example, craigslist might have a FT index for Housing, one for "For Sale" and one for "Employment". Any searches done from the home page would be stitched together from the individual indexes while the common case of searches within a category are more efficient.

    Unsupported technique that will probably break in a future version of Sql Server.

    You'll need to test extensively with data of the same quantity and quality as production. If the behavior changes in future versions of Sql server, you will have no right to complain. This is based off of observations, not proof. Use at your own RISK!!

    A bit of full-text history In Sql Server 2005, the full-text search functionality was in an external process from sqlservr.exe. The way FTS was incorporated into query plans was as a black-box. Sql server would pass FTS a query, FTS would return a stream of id's. This limited the plans to available to Sql Server to plans where the FTS operator could basically be treated as a table scan.

    In Sql Server 2008, FTS was integrated into the engine which improved performance. It also gave the optimizer new options for FTS query plans. Specifically, it now has the option to probe into the FTS index inside a LOOP JOIN operator to check if individual rows match the FTS predicate.(see http://sqlblog.com/blogs/joe_chang/archive/2012/02/19/query-optimizer-gone-wild-full-text.aspx for an excellent discussion of this and ways things can go wrong .)

    Requirements for our optimal FTS query plan There are two characteristics to strive for to get the optimal query plan.

    1. No Sort Operations. Sorting is slow, and we don't want to sort either 20 million rows or 350,000 rows.
    2. Don't return all 350k rows matching the FTS predicate. We need to avoid this if at all possible.

    These two criteria eliminate any plan with a hash join, as a hash join requires consuming all of one input to build the hash table.

    For plans with a loop join, there are two options. Scan the clustered index backwards, and for each row probe into the fulltext search engine to see if that particular row matches. In theory, this seems like a good solution, as once we match 100 rows, we're done. We may have to try 10,000 id's to find the 100 that match, but that may be better than reading all 350k. It could also be worse (see above link to Joe Chang's blog) if each probe is expensive, then our 10k probes could take substantially longer than just reading all 350k rows.

    The other loop join option is to have the FTS portion on the outer side of the loop, and seek into the clustered index. Unfortunately, the FTS engine doesn't like to return results in reverse order, so we'd have to read all 350k, and then sort them to return the top 100.

    The roadblock is getting the FTS engine to return rows in reverse order. If we can overcome this, then we can reduce the IO's to reading only the last 100 rows that match. Fortunately the FTS engine has a tendancy to return rows in order by the key of the unique index specified when the index was created. (This is a natural side-effect of the internal storage the FTS engine uses)

    By adding a computed column that is the negative of the id, and specifying a unique index on that column when creating the FT index, then we're really close.

    CREATE TABLE t (id int not null primary key, txt varchar(max), neg_id as (-id) persisted )
    CREATE UNIQUE INDEX IX_t_neg_id on t (neg_id)
    CREATE FULLTEXT INDEX on t ( txt ) KEY INDEX IX_t_neg_id
    

    Now for our query, we'll use CONTAINSTABLE, and some LEFT-join trickery to ensure that the FTS predicate doesn't end up on the inside of a LOOP JOIN.

    SELECT TOP (100) t.id, t.txt 
    FROM CONTAINSTABLE(t, txt, 'house') ft 
    LEFT JOIN t on tf.[Key] = t.neg_id ORDER BY tf.[key]
    

    The resulting plan should be a loop join that reads only the last 100 rows from the FT index.

    Small gusts of wind that could blow down this house of cards:

    • Complex FTS queries (as in multiple terms or the use of NOT or OR operators can cause Sql 2008+ to get "Smart" and translate the logic into Multiple FTS queries that are joined in the query plan.
    • Any Cumulative Update, Service Pack or Major version upgrade could render this approach useless.
    • It may work in 95% of the cases and timeout in the remaining 5%.
    • It may not work at all for you.

    Good Luck!

    这篇关于从SQL Server全文索引获取前n个最新条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆