通过SQL查询高效获取3列8亿行表的统计信息 [英] get statistics information by SQL query efficiently for table with 3 columns and 800 million rows

查看:48
本文介绍了通过SQL查询高效获取3列8亿行表的统计信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 SQL Server 数据库中有一个包含 3 列和 8 亿行的数据表.

I have a data table with 3 columns and 800 million rows in SQL server database.

  locationID     cardID    value
  NY_xxxxx     xxxxxx    xxxx.xxxx   // x : integer digits from 0 to 9
  NY_xxxxx     xxxxxx    xxxx.xxxx  
  NY_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  ...

我需要计算同一位置的不同 cardID 数量.

I need to count the distinct numbers of cardID for the same location.

此外,我需要计算相同状态下的位置编号.例如,对于上表中的 NY,我们有 3 个位置.

Also, I need to count location numbers in the same state. For example, for NY in the above table, we have 3 locations.

我还需要知道每个州有多少个位置,每个位置有多少 cardID,每个州有多少 cardID.

I also need to know how many locations in each state, and how many cardID in each location and how many cardID in each state.

如何高效的通过SQL查询获取这些统计信息?数据表很大.

How to get these statistics information by SQL query efficiently? The data table size is large.

推荐答案

OP 可能已经知道这一点,但这里是如何获得答案,不考虑效率.首先,每个位置的卡片,如评论中所述:

The OP probably already knows this, but here is how to get the answer, disregarding efficiency. First, cards per location, as described in the comments:

SELECT locationid, COUNT(DISTINCT cardID)
FROM table 
GROUP BY locationid 

接下来每个州都做同样的事情.

Next the same thing per state.

SELECT substring(locationid, 1, 2) state, COUNT(DISTINCT cardID)
FROM table 
GROUP BY substring(locationid, 1, 2)

对于单个状态,

select COUNT(DISTINCT cardID)
from table 
where substring(locationid, 1, 2) = 'NY'

第一个查询的问题是它会返回类似 NY_1234 的 locationID.如果您还没有记住 NY_1234 通俗地说是什么,例如 Rochester,那么您必须到数据库之外看看它是什么.

The problem with the first query is that it will return locationID's like NY_1234. If you have not memorized what NY_1234 is in layman's terms, Rochester for example, you have to go outside your database to see what it is.

第二个查询效率低下,因为它必须对大量记录应用子字符串函数.第三个效率低下,因为它必须扫描整个表.

The second query will be inefficient because it has to apply the substring function on a large number of records. The third will be inefficient because it will have to scan the entire table.

如果你有一个关系模型,你可以做这样的事情.

If you had a relational model you could do stuff like this.

select municipality, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by municipality

select stateCode, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by stateCode

州代码将是 NY、NJ 等如果您想要纽约、新泽西等,那只是另一个字段.

State codes would be NY, NJ, etc If you wanted New York, New Jersey, etc, that would simply be another field.

最后,

select count(distinct cardID)
from table join location on table.locationid = location.locationID
where stateCode = 'NY'

后两个关系查询将比单表查询更快,因为它们不必使用函数.您可以使用 stateCode 上的索引进一步加快速度,但即使您不这样做,也会扫描一个小得多的表.

The second two relational queries would be faster than the single table queries because they wouldn't have to use functions. You could speed it up even more with an index on stateCode, but even if you didn't, a much smaller table would be scanned.

这篇关于通过SQL查询高效获取3列8亿行表的统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆