如何在Hive中为用户定义的聚合函数编写Python UDF [英] How to write a Python UDF for User Defined Aggregate Function in Hive

查看:393
本文介绍了如何在Hive中为用户定义的聚合函数编写Python UDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Python在Hive中的聚合列(GROUP BY后)上进行一些聚合工作。为此我发现有UDAF。我能找到的只是一个Java示例。有没有写在Python中的例子?

或者对于UDF和UDAF之间的python,没有区别吗?对于UDAF,我只需要像减速器那样写它?请注意。

解决方案

您可以使用Hive的流式UDF功能(TRANSFORM)来使用从UDF读取的Python UDF和输出到标准输出。您还没有找到任何PythonUDAF示例,因为UDAF引用了您扩展的Hive Java类,因此它只能在Java中使用。

使用流式UDF时,Hive会选择是启动还是映射或减少作业,因此不需要指定(有关此功能的更多信息,请参阅此链接: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform基本上,你的实现是编写一个python脚本,它从stdin中读取数据,计算一些总数并输出到stdout。要在Hive中实现,请执行以下操作:
$ b $ 1首先,将您的python脚本添加到Hive中的资源库中,以便将其分布到您的群集中:

 添加文件script.py; 

2)然后调用你的转换函数并输入你想要聚合的列。这里是一个例子:

  select transform(input cols)
使用'python script.py'as(output cols )从表

;

根据您需要执行的操作,您可能需要一个单独的映射器和简化器脚本。如果您需要根据列值进行聚合,请记住在映射阶段使用Hive的CLUSTER BY / DISTRIBUTE BY语法,以便将分区数据发送到reducer。



让我知道这是否有帮助。


I would like to do some aggregation work on an aggregate column (after GROUP BY) in Hive using Python. I found there is UDAF for this purpose. All I can find is a Java example. Is there an example on writing in Python?

Or for python between UDF and UDAF, there is no difference? For UDAF, I just need to write it like a reducer? Please advise.

解决方案

You can make use of Hive's streaming UDF functionality (TRANSFORM) to use a Python UDF which reads from stdin and outputs to stdout. You haven't found any Python "UDAF" examples because UDAF refers to the Hive Java class you extend so it would only be in Java.

When using a streaming UDF, Hive will choose whether to launch or a map or reduce job so there is no need to specify (for more on this functionality see this link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform).

Basically, your implementation would be to write a python script which reads from stdin, calculates some aggregate number and outputs it to stdout. To implement in Hive do the following:

1) First add your python script to your resource library in Hive so that it gets distributed across your cluster:

add file script.py;

2) Then call your transform function and input the columns you want to aggregate. Here is an example:

select transform(input cols)
using 'python script.py' as (output cols)
from table
;

Depending on what you need to do, you may need a separate mapper and reducer script. If you need to aggregate based on column value, remember to use Hive's CLUSTER BY/DISTRIBUTE BY syntax in your mapper stage so that partitioned data gets sent to the reducer.

Let me know if this helps.

这篇关于如何在Hive中为用户定义的聚合函数编写Python UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆