如何使用 PIG 加载文件夹中的每个文件? [英] How Can I Load Every File In a Folder Using PIG?

查看:30
本文介绍了如何使用 PIG 加载文件夹中的每个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个每天创建的文件文件夹,所有文件都存储相同类型的信息.我想制作一个脚本,加载最新的 10 个,将它们联合起来,然后在它们上运行一些其他代码.由于 pig 已经有一个 ls 方法,我想知道是否有一种简单的方法可以让我获取最后 10 个创建的文件,并使用相同的加载器和选项以通用名称加载它们.我猜它看起来像:

I have a folder of files created daily that all store the same type of information. I'd like to make a script that loads the newest 10 of them, UNIONs them, and then runs some other code on them. Since pig already has an ls method, I was wondering if there was a simple way for me to get the last 10 created files, and load them all under generic names using the same loader and options. I'm guessing it would look something like:

REGISTER /usr/local/lib/hadoop/hadoop-lzo-0.4.13.jar;
REGISTER /usr/local/lib/hadoop/elephant-bird-2.0.5.jar;
FOREACH file in some_path:
    file = LOAD 'file' 
    USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\\t') 
    AS (i1, i2, i3);

推荐答案

Donald Miner 的回答仍然非常有效,但 IMO 现在在 Python 中使用 Embedded Pig 有更好的方法.O'Reilly 有一个简单的解释 这里.还有一个关于为什么这是你想要做的事情的演示,以及它是如何工作的这里.长话短说,在运行 pig 脚本以确定脚本的各个部分之前,可以访问很多功能.在 Jython 中包装和/或动态生成脚本的一部分让您做到这一点.欢呼吧!

Donald Miner's answer still works perfectly well, but IMO there's a better approach to this now using Embedded Pig in Python. O'Reilly has a brief explanation here. There's also a presentation on why this is something you'd want to do, and how it works here. Long story short, there's a lot of functionality it would be nice to have access to before running a pig script to determine parts of the script. Wrapping and/or dynamically generating parts of the script in Jython let's you do that. Rejoice!

这篇关于如何使用 PIG 加载文件夹中的每个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆