SQL Server - 加快对大表的计数 [英] SQL Server - Speed up count on large table

查看:46
本文介绍了SQL Server - 加快对大表的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含接近 3000 万 条记录的表.只是几列.'Born' 列之一具有不超过 30 个不同的值,并且在其上定义了一个索引.我需要能够过滤该列并有效地翻阅结果.

I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.

现在我有(例如,如果我要搜索的年份是 '1970' - 它是我的存储过程中的一个参数):

For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):

WITH PersonSubset as
(
    SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
    FROM Person WITH (INDEX(IX_Person_Born)) 
    WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30

这种类型的每个查询(仅使用 Born 参数)返回超过 100 万个结果.我注意到最大的开销是用于返回总结果的计数.如果我从 select 子句中删除 (SELECT count(*) FROM PersonSubset) AS TotalPeople 整个过程会加快很多.

Every query of that sort (only Born parameter used) returns just over 1 million results. I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.

有没有办法加快该查询中的计数.我关心的是返回分页结果和总数.

Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.

推荐答案

更新了以下评论中的讨论

这里问题的原因是基数非常低>IX_Person_Born 索引.

The cause of the problem here is very low cardinality of the IX_Person_Born index.

SQL 索引非常擅长快速缩小值的范围,但是当您有大量具有相同值的记录时,它们就会出现问题.

SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.

你可以把它想象成电话簿的索引——如果你想找到Smith, John",你首先会发现有很多以 S 开头的名字,然后是一页又一页叫 Smith 的人,然后是很多约翰.你最终扫描了这本书.

You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.

这是复杂的,因为电话簿中的索引是聚集的 - 记录按姓氏排序.相反,如果您想找到每个叫约翰"的人,您将需要进行大量查找.

This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.

这里有 3000 万条记录,但只有 30 个不同的值,这意味着最好的索引仍然返回大约 100 万条记录——在这种规模下,它也可能是表扫描.这 100 万个结果中的每一个都不是实际记录 - 它是从索引到表(电话簿类比中的页码)的查找,这使得它更慢.

Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.

高基数指数(比如完整的出生日期),而不是年份会更快.

A high cardinality index (say for full date of birth), rather than year would be much quicker.

这是所有 OLTP 关系数据库的普遍问题:低基数 + 庞大的数据集 = 慢查询,因为索引树没有多大帮助.

This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.

简而言之:没有比使用 T-SQL 和索引更快获得计数的方法了.

In short: there's no significantly quicker way to get the count using T-SQL and indexes.

您有几个选择:

OLAP/Cube 汇总或自己动手:

Either OLAP/Cube rollups or do it yourself:

select Born, count(*) 
from Person 
group by Born

优点是多维数据集查找或检查缓存非常快.问题是数据会过时,您需要某种方式来解决这个问题.

The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.

拆分为两个查询:

SELECT count(*) 
FROM Person 
WHERE Born = '1970'

SELECT TOP 30 *
FROM Person 
WHERE Born = '1970'

然后在并行服务器端运行这些,或将其添加到用户界面.

Then run these either in parallel server side, or add it to the user interface.

这个问题是 no-SQL 解决方案相对于传统关系数据库的一大优势.在非 SQL 系统中,Person 表在许多廉价服务器之间联合(或分片).当用户搜索时,同时检查每台服务器.

This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.

在这一点上,技术变革可能已经结束,但可能值得研究,所以我将其包括在内.

At this point a technology change is probably out, but it may be worth investigating so I've included it.

过去我在使用这种大小的数据库时遇到过类似问题,并且(取决于上下文)我使用了选项 1 和选项 2.如果这里的总数用于分页,那么我可能会选择选项2 和 AJAX 调用以获取计数.

I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.

这篇关于SQL Server - 加快对大表的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆