用 Pig Latin 为每组编写一个文件 [英] Writing one file per group in Pig Latin

查看:28
本文介绍了用 Pig Latin 为每组编写一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:我有许多包含 Apache Web 服务器日志条目的文件.这些条目不是按日期时间顺序排列的,而是分散在文件中.我正在尝试使用 Pig 读取一天的文件,按日期时间对日志条目进行分组和排序,然后将它们写入以其包含的条目的日期和小时命名的文件.

The Problem: I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.

设置:导入文件后,我使用正则表达式获取日期字段,然后将其截断为小时.这会生成一个在一个字段中具有记录的集合,而在另一个字段中将日期截断为小时.从这里开始,我将在日期-小时字段上进行分组.

Setup: Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.

第一次尝试:我的第一个想法是在使用 FOREACH 遍历我的组时使用 STORE 命令,但很快发现这对 Pig 来说并不酷.

First Attempt: My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.

第二次尝试:我的第二次尝试是在存钱罐中使用 MultiStorage() 方法,该方法在我查看文件之前效果很好.问题是 MulitStorage 想要将所有字段写入文件,包括我曾经分组的字段.我真正想要的只是写入文件的原始记录.

Second Attempt: My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.

问题:所以……我是将 Pig 用于它不适合的用途,还是有更好的方法来解决这个问题?既然我有这个问题,我将使用一个简单的代码示例来进一步解释我的问题.一旦我有了它,我会把它贴在这里.提前致谢.

The Question: So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.

推荐答案

Pig 开箱即用,没有很多功能.它完成基本的工作,但更多时候我发现自己必须编写自定义 UDF 或加载/存储函数才能从 95% 的方式到 100% 的方式.我通常认为这是值得的,因为与整个 MapReduce 程序相比,仅编写一个小型存储函数所需的 Java 代码要少得多.

Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.

您的第二次尝试与我会做的非常接近.您应该复制/粘贴 MultiStorage 的源代码或使用继承作为起点.然后,修改 putNext 方法以去除组值,但仍写入该文件.不幸的是,Tuple 没有 removedelete 方法,因此您必须重写整个元组.或者,如果您只有原始字符串,只需将其拉出并输出包裹在 Tuple 中.

Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.

一些关于编写加载/存储函数的通用文档,以防您需要更多帮助:http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

这篇关于用 Pig Latin 为每组编写一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆