如何使用PIG加载文件夹中的每个文件? [英] How Can I Load Every File In a Folder Using PIG?

查看:109
本文介绍了如何使用PIG加载文件夹中的每个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个每天创建的文件文件夹,所有文件都存储相同类型的信息.我想制作一个脚本,以加载最新的10个脚本,并对其进行UNION,然后在其上运行一些其他代码.由于Pig已经具有ls方法,因此我想知道是否有一种简单的方法来获取最后创建的10个文件,并使用相同的加载器和选项以通用名称加载它们.我猜是这样的:

I have a folder of files created daily that all store the same type of information. I'd like to make a script that loads the newest 10 of them, UNIONs them, and then runs some other code on them. Since pig already has an ls method, I was wondering if there was a simple way for me to get the last 10 created files, and load them all under generic names using the same loader and options. I'm guessing it would look something like:

REGISTER /usr/local/lib/hadoop/hadoop-lzo-0.4.13.jar;
REGISTER /usr/local/lib/hadoop/elephant-bird-2.0.5.jar;
FOREACH file in some_path:
    file = LOAD 'file' 
    USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\\t') 
    AS (i1, i2, i3);

推荐答案

唐纳德·迈纳(Donald Miner)的答案仍然运行良好,但是IMO现在在Python中使用 Embedded Pig 对此有更好的解决方法. O'Reilly进行了简要说明这里.还有一个演示文稿,介绍了为什么要这样做以及它如何工作,此处.长话短说,在运行Pig脚本以确定脚本的各个部分之前,可以使用很多功能.在Jython中包装和/或动态生成脚本的各个部分,让您执行此操作.欢喜!

Donald Miner's answer still works perfectly well, but IMO there's a better approach to this now using Embedded Pig in Python. O'Reilly has a brief explanation here. There's also a presentation on why this is something you'd want to do, and how it works here. Long story short, there's a lot of functionality it would be nice to have access to before running a pig script to determine parts of the script. Wrapping and/or dynamically generating parts of the script in Jython let's you do that. Rejoice!

这篇关于如何使用PIG加载文件夹中的每个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆