HIVE 中的 LIMIT 子句真的是随机的吗? [英] Is LIMIT clause in HIVE really random?

查看:51
本文介绍了HIVE 中的 LIMIT 子句真的是随机的吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HIVE 注释的文档LIMIT 子句返回随机选择的行.我一直在使用 LIMIT 1 在超过 800,000 条记录的表上运行 SELECT 表,但它总是返回相同的记录.

The documentation of HIVE notes that LIMIT clause returns rows chosen at random. I have been running a SELECT table on a table with more than 800,000 records with LIMIT 1, but it always return me the same record.

我正在使用 Shark 发行版,我想知道这是否与这种非预期行为有关?任何想法将不胜感激.

I'm using the Shark distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.

谢谢,维萨赫

推荐答案

尽管文档说明它随机返回行,但实际上并非如此.

Even though the documentation states it returns rows at random, it's not actually true.

它返回随机选择的行",因为它出现在数据库中,没有任何 where/order by 子句.这意味着它并不像您想象的那样真正随机(或随机选择),只是无法确定返回行的顺序.

It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.

只要您在那里输入 order by x DESC limit 5,它就会返回您从中选择的任何内容的最后 5 行.

As soon as you slap a order by x DESC limit 5 on there, it returns the last 5 rows of whatever you're selecting from.

要随机返回行,您需要使用以下内容:order by rand() LIMIT 1

To get rows returned at random, you would need to use something like: order by rand() LIMIT 1

但是,如果您的索引设置不正确,它会影响速度.通常我做一个最小/最大来获取表上的 ID,然后在它们之间做一个随机数,然后选择这些记录(在你的情况下,只有 1 条记录),这往往比让数据库更快工作,尤其是在大型数据集上

However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

这篇关于HIVE 中的 LIMIT 子句真的是随机的吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆