TABLESAMPLE 返回错误的行数? [英] TABLESAMPLE returns wrong number of rows?

查看:25
本文介绍了TABLESAMPLE 返回错误的行数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现了 TABLESAMPLE 子句,但令人惊讶的是它没有返回我指定的行数.

我使用的表有大约 14M 行,我想要一个 10000 行的任意样本.

select * from tabData TABLESAMPLE(10000 ROWS)

我每次执行时得到的不是 10000,而是不同的数字(在 8000 和 14000 之间).

这是怎么回事,我是否误解了 TABLESAMPLE 的预期用途?

编辑:

大卫的链接 解释得很好.

这总是以一种有效的方式返回 10000 个大致随机的行:

select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);

REPEATABLE 选项有助于保持相同(除非数据已更改)

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS) REPEATABLE(100);

因为我想知道将 TABLESAMPLE 与大量行一起使用是否更昂贵以确保(?)我得到正确的行号,所以我已经对其进行了测量;

1.loop(20 次):

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS);(9938 行受影响)(10000 行受影响)(9383 行受影响)(9526 行受影响)(10000 行受影响)(9545 行受影响)(9560 行受影响)(9673 行受影响)(9608 行受影响)(9476 行受影响)(9766 行受影响)(10000 行受影响)(9500 行受影响)(9941 行受影响)(9769 行受影响)(9547 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(9478 行受影响)第一批(仅 10000 行)在:14 秒内完成!

2.loop(20次):

select TOP 10000 * from tabData TABLESAMPLE(10000000 ROWS);(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(受影响的 10000 行)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)(10000 行受影响)第二批(最大行数)在:13 秒内完成!

3.loop:使用 ORDER BY NEWID() 对 100% 随机行进行 counterscheck:

select TOP 10000 * from tabData ORDER BY NEWID();(10000 行受影响)

在执行一次持续 23 分钟

后取消

结论:

令人惊讶的是,使用精确的 TOP 子句和 TABLESAMPLE 中的大量数字的方法慢.因此,它是 ORDER BY NEWID() 的一个非常有效的替代方案,如果行不是每行随机而是每页级别都无关紧要(表的每个 8K 页被赋予一个随机值).

解决方案

参见 文章在这里.您需要添加一个 top 子句和/或使用可重复选项来获取您想要的行数.

I've just discovered the TABLESAMPLE clause but surprisingly it doesn't return the number of rows i've specified.

The table that i've used has ~14M rows and i wanted an arbitrary sample of 10000 rows.

select * from tabData TABLESAMPLE(10000 ROWS)

I get not 10000 but a different number everytime i execute it(between 8000 and 14000).

What's going on here, have i misunderstood the intended purpose of TABLESAMPLE?

Edit:

David's link explains it pretty well.

This returns always 10000 roughly random rows in an efficient way:

select TOP 10000 * from tabData TABLESAMPLE(20000 ROWS);

and the REPEATABLE option helps to get always the same (unless data has changed)

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS) REPEATABLE(100);

Since i wanted to know if it's more expensive to use TABLESAMPLE with a large number of rows to ensure(?) that i get the correct row-number, i've measured it;

1.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000 ROWS);

(9938 row(s) affected)
(10000 row(s) affected)
(9383 row(s) affected)
(9526 row(s) affected)
(10000 row(s) affected)
(9545 row(s) affected)
(9560 row(s) affected)
(9673 row(s) affected)
(9608 row(s) affected)
(9476 row(s) affected)
(9766 row(s) affected)
(10000 row(s) affected)
(9500 row(s) affected)
(9941 row(s) affected)
(9769 row(s) affected)
(9547 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(9478 row(s) affected)
First batch(only 10000 rows) completed in: 14 seconds!

2.loop (20 times):

select TOP 10000 * from tabData TABLESAMPLE(10000000 ROWS);

(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
(10000 row(s) affected)
Second batch(max rows) completed in: 13 seconds!

3.loop: counterscheck with 100% random rows using ORDER BY NEWID():

select TOP 10000 * from tabData ORDER BY NEWID();

(10000 row(s) affected)

Cancelled after one execution that lasted 23 minutes

Conclusion:

So suprisingly the approach with an exact TOP clause and a large number in TABLESAMPLE is not slower. Hence it's a very efficient alternative to ORDER BY NEWID() if it doesn't matter that the rows are not random per row but per page level(Each 8K page for the table is given a random value).

解决方案

See the article here. You need to add a top clause and/or use the repeatable option to get the number of rows you want.

这篇关于TABLESAMPLE 返回错误的行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆