HiveQL和rank() [英] HiveQL and rank()

查看:96
本文介绍了HiveQL和rank()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法理解HiveQL的排名()。我在WWW上发现了一些排名UDF的实现,例如爱德华的好例子。我可以加载和访问函数,但我无法让他们做我想做的事。下面是一个详细的例子:



将UDF加载到CLI过程中:

  $ javac -classpath /home/hadoop/hadoop/hadoop-core-1.0.4.jar:/home/hadoop/hive/lib/hive-exec-0.10.0.jar com / m6d / hiveudf / Rank2。 java 
$ jar -cvf Rank2.jar com / m6d / hiveudf / Rank2.class
hive> ADD JAR /home/hadoop/MyDemo/Rank2.jar;
hive> CREATE TEMPORARY FUNCTION Rank2 AS'com.m6d.hiveudf.Rank2';

创建表:

 创建表格购买(
SalesRepId字符串,
PurchaseOrderId INT,
金额INT

行格式限定
字段终止','
'\\\
'终止的线路;

从CSV中加载数据:

  Jana,1,100 
Nadia,2,200
Nadia,3,600
Daniel,4,80
Jana,5,120
William,6,170
Daniel,7,140

通过CLI提供这个功能:

  LOAD DATA 
LOCAL INPATH'/home/hadoop/MyDemo/purchases.csv'
INTO TABLE购买;

现在我可以看到我的最高销售代表:

 选择SalesRepId,将金额(金额)作为批量
从采购
组中通过SalesRepId
ORDER BY批量DESC;



Nadia售出了800美元的东西,Daniel和Jana都卖出了220美元,而William卖出了170美元

  SalesRep金额
-------- ------
Nadia 800
Daniel 220
Jana 220
William 170

现在我想要要做的是数字他们:纳迪亚是第一,丹尼尔和贾纳并列第二,威廉是第四(不#3)

 <$从
中选择SalesRepId,V.volume,rank2(V.volume)
(选择SalesRepId,总额(金额)作为批量
从购买
组中通过SalesRepId
ORDER BY volume DESC)V;

这是我得到的,但不是我想要的:

  SalesRep金额等级
-------- ------ ----
Nadia 800 1
Daniel 220 1
Jana 220 2
William 170 1

这就是我想,但我不能让配置单元为我做:

  SalesRep金额等级
---- ---- ------ ----
Nadia 800 1
Daniel 220 2
Jana 220 2
William 170 4

你能帮我用正确的HiveQL来排列我的销售代表吗?



感谢JtheRocker的回应。他的变化导致了这个列表:

  SalesRep金额等级
-------- ---- - ----
William 170 1
Daniel 220 2
Jana 220 2
Nadia 800 3

稍微修改显示Nadia为4(不是第3个):

 私人row_number; 
@Override
public Object evaluate(DeferredObject [] currentKey)throws HiveException {
row_number ++;
if(!sameAsPreviousKey(currentKey)){
this.counter = row_number;
copyToPreviousKey(currentKey);
}
return new Long(this.counter);


解决方案

在Hive 0.11中引入的窗口化和分析功能,您可以使用:

$ b

pre $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $金额)作为来自采购组的销售量由SalesRepId)V;


I can't understand HiveQL rank(). I've found a couple of implementations of rank UDF's on the WWW, such as Edward's nice example. I can load and access the functions, but I can't get them to do what I want. Here is a detailed example:

Loading the UDF into the CLI process:

$ javac -classpath /home/hadoop/hadoop/hadoop-core-1.0.4.jar:/home/hadoop/hive/lib/hive-exec-0.10.0.jar com/m6d/hiveudf/Rank2.java 
$ jar -cvf Rank2.jar com/m6d/hiveudf/Rank2.class
hive> ADD JAR /home/hadoop/MyDemo/Rank2.jar;
hive> CREATE TEMPORARY FUNCTION Rank2 AS 'com.m6d.hiveudf.Rank2'; 

Create a table:

create table purchases (
  SalesRepId String, 
  PurchaseOrderId INT, 
  Amount INT
) 
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n';

Load data from this CSV:

Jana,1,100
Nadia,2,200
Nadia,3,600
Daniel,4,80
Jana,5,120
William,6,170
Daniel,7,140

With this from the CLI:

LOAD DATA 
  LOCAL INPATH '/home/hadoop/MyDemo/purchases.csv'
  INTO TABLE purchases;

Now I can see my top Sales Reps:

select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC;

Nadia has sold $800 of stuff, Daniel and Jana have both sold $220, and William has sold $170

SalesRep    Amount
--------    ------
Nadia       800
Daniel      220
Jana        220
William     170

Now all I want to do is number them: Nadia is #1, Daniel and Jana are tied for #2, and William is #4 (not #3)

select SalesRepId, V.volume,rank2(V.volume)
from 
(select SalesRepId,sum(amount) as volume
from purchases
group by SalesRepId
ORDER BY volume DESC) V;

This is what I get, but NOT what I want:

SalesRep   Amount  Rank
--------   ------  ----
Nadia       800      1
Daniel      220      1
Jana        220      2
William     170      1

This is what I WANT, but I can't make hive do it for me:

SalesRep   Amount  Rank
--------   ------  ----
Nadia       800      1
Daniel      220      2
Jana        220      2
William     170      4

Can you help me with the correct HiveQL to rank my Sales Reps?

Thanks to JtheRocker for his response. His change resulted in this list:

SalesRep   Amount  Rank
--------   ------  ----
William     170     1
Daniel      220     2
Jana        220     2
Nadia       800     3

A slight modification to show Nadia as 4th (not 3rd):

private row_number;
@Override
public Object evaluate(DeferredObject[] currentKey) throws HiveException {
  row_number++;
  if (!sameAsPreviousKey(currentKey)) {
    this.counter = row_number;
    copyToPreviousKey(currentKey);
  }
return new Long(this.counter);
}

解决方案

With the Windowing and Analytics functions introduced in Hive 0.11, you can use:

select SalesRepId, volume as amount , rank() over (order by V.volume desc) as rank from 
(select SalesRepId,sum(amount) as volume from purchases group by SalesRepId) V;

这篇关于HiveQL和rank()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆