HIVE中的LIMIT子句是否真的是随机的? [英] Is LIMIT clause in HIVE really random?
问题描述
HIVE 文档 code>注意 LIMIT
子句返回随机选择的行
。我在 800,000
记录中使用 LIMIT 1的表上运行了一个 SELECT
表
,但它总是给我返回相同的记录。
我正在使用 Shark
分配,我想知道这是否与此有关预期的行为?任何想法将不胜感激。
感谢,
Visakh
解决方案尽管文档声明它随机返回行,但实际上并不是这样。
它返回随机选择的行,因为它在数据库中显示时没有任何where / order by子句。这意味着它不是真正的随机(或随机选择),只是这些行返回的顺序无法确定。
只要你在那里使用x DESC限制5
执行顺序,它返回你选择的最后5行。
$ b $为了获得随机返回的行,你需要使用类似于: order by rand()LIMIT 1
然而,如果您的索引设置不正确,速度可能会受到影响。通常我会用最小/最大值来获取表上的ID,然后在它们之间做一个随机数,然后选择那些记录(在你的情况下,只有1条记录),这往往比数据库要快这项工作,特别是在大型数据集上
The documentation of HIVE
notes that LIMIT
clause returns rows chosen at random
. I have been running a SELECT
table on a table with more than 800,000
records with LIMIT 1
, but it always return me the same record.
I'm using the Shark
distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.
Thanks,
Visakh
解决方案 Even though the documentation states it returns rows at random, it's not actually true.
It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.
As soon as you slap a order by x DESC limit 5
on there, it returns the last 5 rows of whatever you're selecting from.
To get rows returned at random, you would need to use something like: order by rand() LIMIT 1
However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset
这篇关于HIVE中的LIMIT子句是否真的是随机的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!