HiveQL 和 rank() [英] HiveQL and rank()
问题描述
我无法理解 HiveQL rank().我在 WWW 上找到了几个 rank UDF 的实现,例如 Edward 的好例子.我可以加载和访问这些功能,但我无法让它们做我想做的事.下面是一个详细的例子:
I can't understand HiveQL rank(). I've found a couple of implementations of rank UDF's on the WWW, such as Edward's nice example. I can load and access the functions, but I can't get them to do what I want. Here is a detailed example:
将 UDF 加载到 CLI 过程中:
Loading the UDF into the CLI process:
$ javac -classpath /home/hadoop/hadoop/hadoop-core-1.0.4.jar:/home/hadoop/hive/lib/hive-exec-0.10.0.jar com/m6d/hiveudf/Rank2.java
$ jar -cvf Rank2.jar com/m6d/hiveudf/Rank2.class
hive> ADD JAR /home/hadoop/MyDemo/Rank2.jar;
hive> CREATE TEMPORARY FUNCTION Rank2 AS 'com.m6d.hiveudf.Rank2';
创建表格:
create table purchases (
SalesRepId String,
PurchaseOrderId INT,
Amount INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '
';
从此 CSV 加载数据:
Load data from this CSV:
Jana,1,100
Nadia,2,200
Nadia,3,600
Daniel,4,80
Jana,5,120
William,6,170
Daniel,7,140
来自 CLI:
LOAD DATA
LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
INTO TABLE purchases;
现在我可以看到我的顶级销售代表了:
Now I can see my top Sales Reps:
select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC;
Nadia 卖了 800 美元,Daniel 和 Jana 都卖了 220 美元,William 卖了 170 美元
Nadia has sold $800 of stuff, Daniel and Jana have both sold $220, and William has sold $170
SalesRep Amount
-------- ------
Nadia 800
Daniel 220
Jana 220
William 170
现在我只想给他们编号:Nadia 排名第一,Daniel 和 Jana 并列第二,而威廉排名第四(不是第三)
Now all I want to do is number them: Nadia is #1, Daniel and Jana are tied for #2, and William is #4 (not #3)
select SalesRepId, V.volume,rank2(V.volume)
from
(select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC) V;
这是我得到的,但不是我想要的:
This is what I get, but NOT what I want:
SalesRep Amount Rank
-------- ------ ----
Nadia 800 1
Daniel 220 1
Jana 220 2
William 170 1
这就是我想要的,但我不能让 hive 为我做:
This is what I WANT, but I can't make hive do it for me:
SalesRep Amount Rank
-------- ------ ----
Nadia 800 1
Daniel 220 2
Jana 220 2
William 170 4
您能帮我使用正确的 HiveQL 来对我的销售代表进行排名吗?
Can you help me with the correct HiveQL to rank my Sales Reps?
感谢 JtheRocker 的回复.他的改变导致了这个列表:
Thanks to JtheRocker for his response. His change resulted in this list:
SalesRep Amount Rank
-------- ------ ----
William 170 1
Daniel 220 2
Jana 220 2
Nadia 800 3
稍微修改一下,将 Nadia 显示为第 4 名(不是第 3 名):
A slight modification to show Nadia as 4th (not 3rd):
private row_number;
@Override
public Object evaluate(DeferredObject[] currentKey) throws HiveException {
row_number++;
if (!sameAsPreviousKey(currentKey)) {
this.counter = row_number;
copyToPreviousKey(currentKey);
}
return new Long(this.counter);
}
推荐答案
使用 Hive 0.11 中引入的窗口和分析功能,您可以使用:
select SalesRepId, volume as amount , rank() over (order by V.volume desc) as rank from
(select SalesRepId,sum(amount) as volume from purchases group by SalesRepId) V;
这篇关于HiveQL 和 rank()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!