SQL Server - 值得索引大字符串键吗? [英] SQL server - worth indexing large string keys?

查看:22
本文介绍了SQL Server - 值得索引大字符串键吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,它有一个大字符串键 (varchar(1024)),我想在 SQL 服务器上对其进行索引(我希望能够快速搜索它,但插入也很重要).在 sql 2008 中,我没有收到警告,但在 sql server 2005 下,它告诉我它超过 900 个字节,并且超过此大小的列的插入/更新将被删除(或该区域中的某些内容)

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)

如果我想在这个大列上建立索引,我的替代方案是什么?如果可以的话,我不知道是否值得.

What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.

推荐答案

所有键都接近 900 字节的索引将非常大且非常深(每页非常少的键会导致非常高的 B 树).

An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).

>

这取决于您计划如何查询这些值.索引在以下几种情况下很有用:

It depends on how you plan to query the values. An index is useful in several cases:

  • 当一个值被探测到时.这是最典型的用途,即在表中搜索精确值时.典型示例是 WHERE column='ABC' 或连接条件 ON a.column = B.someothercolumn.
  • 扫描范围时.当在表中搜索<​​em>范围的值时,这也是相当典型的.除了 WHERE column BETWEEN 'ABC' AND 'DEF' 的明显示例之外,还有其他不太明显的示例,例如部分匹配:WHERE column LIKE 'ABC%'.
  • 订购要求.这种用法鲜为人知,但索引可以帮助具有显式 ORDER BY 列 要求的查询避免走走停停排序,还可以帮助某些隐藏的排序要求,例如 ROW_NUMBER() OVER(ORDER BY 列).
  • when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
  • when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
  • an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).

那么,为什么需要索引?什么样的查询会使用它?

So, why do you need the index for? What kind of queries would use it?

对于范围扫描和排序要求,除了拥有索引之外别无他法,您必须权衡索引的成本与收益.

For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.

对于探针,您可以潜在地使用哈希来避免索引非常大的列.创建一个持久化计算列作为 column_checksum = CHECKSUM(column) 然后在该列上建立索引.必须重写查询以使用 WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'.必须仔细考虑权衡窄索引(32 位校验和)的优点与冲突双重检查和缺乏范围扫描和排序功能的缺点.

For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.

评论后

我曾经遇到过类似的问题,我使用了哈希列.该值太大而无法索引(> 1K),我还需要将该值转换为要存储的 ID(基本上是字典).大致意思:

I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:

create table values_dictionary (
  id int not null identity(1,1),
  value varchar(8000) not null,
  value_hash = checksum(value) persisted,
  constraint pk_values_dictionary_id
     primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go

create procedure usp_get_or_create_value_id (
   @value varchar(8000),
   @id int output)
begin
   declare @hash = CHECKSUM(@value);
   set @id = NULL;
   select @id = id
      from table
      where value_hash = @hash
      and value = @value;
  if @id is null
  begin
      insert into values_dictionary (value)
        values (@value);
      set @id = scope_identity();
  end
end

在这种情况下,字典表被组织为 values_hash 列上的聚集索引,它将所有冲突的哈希值组合在一起.添加 id 列以使聚集索引唯一,避免需要 隐藏的唯一标识符列.这种结构使 @value 的查找尽可能高效,没有对 value 的非常低效的索引并绕过 900 个字符的限制.id 上的主键是非集群的,这意味着从 id 中查找 value 会导致集群中一个额外探测的开销索引.

In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for @value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.

不确定这是否能解决您的问题,您显然比我更了解您的实际情况.此外,该代码不处理错误条件,实际上可以插入重复的 @value 条目,这可能正确也可能不正确.

Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate @value entries, which may or may not be correct.

这篇关于SQL Server - 值得索引大字符串键吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆