使用 Pig 和 Python [英] Using Pig and Python

查看:20
本文介绍了使用 Pig 和 Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果这个问题措辞不当,我深表歉意:我正在着手一个大规模的机器学习项目,我不喜欢用 Java 编程.我喜欢用 Python 编写程序.我听说过关于猪的好消息.我想知道是否有人可以向我澄清 Pig 与 Python 结合用于数学相关工作的可用性.另外,如果我要编写流式 python 代码",Jython 会出现吗?如果它出现在图片中是否更有效?

Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don't like programming in Java. I love writing programs in Python. I have heard good things about Pig. I was wondering if someone could clarify to me how usable Pig is in combination with Python for mathematically related work. Also, if I am to write "streaming python code", does Jython come into the picture? Is it more efficient if it does come into the picture?

谢谢

P.S:出于多种原因,我不喜欢按原样使用 Mahout 的代码.我可能想使用他们的一些数据结构:知道这是否可行会很有用.

P.S: I for several reasons would not prefer to use Mahout's code as is. I might want to use a few of their data structures: It would be useful to know if that would be possible to do.

推荐答案

在 Hadoop 中使用 Python 的另一个选择是 PyCascading.您可以在 Python 中将整个工作放在一起,而不是在 Python/Jython 中只编写 UDF,而是在定义数据处理管道的同一脚本中使用 Python 函数作为UDF".Jython用作Python解释器,流操作的MapReduce框架是级联.联接、分组等在本质上与 Pig 类似,因此如果您已经了解 Pig,那也就不足为奇了.

Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.

字数统计示例如下所示:

A word counting example looks like this:

@map(produces=['word'])
def split_words(tuple):
    # This is called for each line of text
    for word in tuple.get(1).split():
        yield [word]

def main():
    flow = Flow()
    input = flow.source(Hfs(TextLine(), 'input.txt'))
    output = flow.tsv_sink('output')

    # This is the processing pipeline
    input | split_words | GroupBy('word') | Count() | output

    flow.run()

这篇关于使用 Pig 和 Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆