基于感兴趣的日期范围作为参数输入,限制在Pig Latin中加载日志文件 [英] Restricting loading of log files in Pig Latin based on interested date range as parameter input

查看:71
本文介绍了基于感兴趣的日期范围作为参数输入,限制在Pig Latin中加载日志文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在基于参数输入加载日志文件时遇到问题,并且想知道是否有人能够提供一些指导.有问题的日志是Omniture日志,存储在基于年,月和日的子目录中(例如/year = 2013/month = 02/day = 14),并且文件名中带有日期戳.在任何一天中,都可能存在多个日志,每个日志都有数百MB.

I'm having a problem loading log files based on parameter input and was wondering whether someone would be able to provide some guidance. The logs in question are Omniture logs, stored in subdirectories based on year, month, and day (eg. /year=2013/month=02/day=14), and with the date stamp in the filename. For any day, multiple logs could exist, each hundreds of MB.

我有一个Pig脚本,该脚本当前处理整个月的日志,其中月份和年份指定为脚本参数(例如/year = $ year/month = $ month/day = *).它工作正常,我们对此非常满意.就是说,我们要切换到每周处理日志,这意味着以前的LOAD路径无法正常工作(几周可以包装数月甚至数年).为了解决这个问题,我有一个Python UDF,它带有一个开始日期,并吐出必要的glob来存储一周的日志,例如:

I have a Pig script which currently processes logs for an entire month, with the month and the year specified as script parameters (eg. /year=$year/month=$month/day=*). It works fine and we're quite happy with it. That said, we want to switch to weekly processing of logs, which means the previous LOAD path glob won't work (weeks can wrap months as well as years). To solve this, I have a Python UDF which takes a start date and spits out the necessary glob for a week's worth of logs, eg:

>>> log_path_regex(2013, 1, 28)
'{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'

然后将这个glob插入适当的路径:

This glob will then be inserted in the appropriate path:

> %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz';
> data = LOAD '$omniture_log_path' USING OmnitureTextLoader(); // See http://github.com/msukmanowsky/OmnitureTextLoader

不幸的是,我终生无法弄清楚如何根据$ year,$ month和$ day脚本参数填充$ week_path.我尝试使用%declare,但抱怨的声音很低,它记录了日志,但从未这样做:

Unfortunately, I can't for the life of me figure out how to populate $week_path based on $year, $month and $day script parameters. I tried using %declare but grunt complains, says its logging but never does:

> %declare week_path util.log_path_regex(year, month, day);
2013-02-14 16:54:02,648 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
2013-02-1416:54:02,648 [main] INFO  org.apache.pig.Main - Logging error messages to: /tmp/pig_1360878842643.log % ls  /tmp/pig_1360878842643.log
ls: cannot access /tmp/pig_1360878842643.log: No such file or directory

如果我在参数前加美元符号或引号前加引号,则会导致相同的错误.

The same error results if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.

如果我尝试使用define(我相信它仅适用于静态Java函数),则会得到以下信息:

If I try to use define (which I believe only works for static Java functions), I get the following:

> define week_path util.log_path_regex(year, month, day);
2013-02-14 17:00:42,392 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11, column 37>  mismatched input 'year' expecting RIGHT_PAREN

与%declare一样,如果我在参数前加美元符号或在引号前加引号,则会出现相同的错误.

As with %declare, I get the same error if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.

我已经搜索过,还没有提出解决方案.我可能正在寻找错误的内容.调用shell命令可能会起作用,但是会很困难,因为这会使我们的脚本部署复杂化,并且由于我们要从S3而不是已装载的目录中检索日志,因此可能不可行.同样,将生成的glob作为单个参数传递可能会使实例化的MapReduce集群上的自动化作业复杂化.

I've searched around and haven't come up with a solution. I'm possibly searching for the wrong thing. Invoking a shell command may work, but would be difficult as it would complicate our script deploy and may not be feasible given we're retrieving logs from S3 and not a mounted directory. Similarly, passing the generated glob as a single parameter may complicate an automated job on an instantiated MapReduce cluster.

除了使用glob之外,还有一种很好的Pig友好方法可以限制LOAD.就是说,我仍然必须使用我的UDF,这似乎是问题的根源.

It's also likely there's a nice Pig-friendly way to restrict LOAD other than using globs. That said, I'd still have to use my UDF which seems to be the root of the issue.

这真的归结为我想要在我的LOAD语句中包含Pig内建的动态路径glob.猪似乎没那么容易.

This really boils down to me wanting to include a dynamic path glob built inside Pig in my LOAD statement. Pig doesn't seem to be making that easy.

我需要将UDF转换为静态Java方法吗?还是会遇到相同的问题? (我不愿意在可能的情况下执行此操作.它是8行Python函数,与等效的Java代码相比,它易于部署并且可以被其他人维护得多.)

Do I need to convert my UDF to a static Java method? Or will I run into the same issue? (I hesitate to do this on the off-chance it will work. It's an 8-line Python function, readily deployable and much more maintainable by others than the equivalent Java code would be.)

自定义LoadFunc是答案吗?这样,我大概必须指定/year = /month = /day = *,并强制Pig测试每个文件名以查找介于两个日期之间的日期戳.这似乎是一个巨大的黑客,而且浪费资源.

Is a custom LoadFunc the answer? With that, I'd presumably have to specify /year=/month=/day=* and force Pig to test every file name for a date stamp which falls between two dates. That seems like a huge hack and a waste of resources.

有什么想法吗?

推荐答案

I posted this question to the Pig user list. My understanding is that Pig will first pre-process its scripts to substitute parameters, imports and macros before building the DAG. This makes building new variables based on existing ones somewhat impossible, and explains my failure to build a UDF to construct a path glob.

如果您是Pig开发人员,需要根据现有参数构建新变量,则可以使用另一个脚本来构造这些变量,并将其作为参数传递给Pig脚本,或者可以探索需要在哪里使用这些变量新变量,然后根据您的需要将它们构建在单独的结构中.

If you are a Pig developer requiring new variables to be built based on existing parameters, you can either use another script to construct those variables and pass them as parameters to your Pig script, or you can explore where you need to use those new variables and build them in a separate construct based on your needs.

就我而言,我很不情愿地选择创建

In my case, I reluctantly opted to create a custom LoadFunc as described by Cheolsoo Park. This LoadFunc accepts the day, month and year for the beginning of the period for the report in its constructor, and builds a pathGlob attribute to match paths for that period. That pathGlob is then inserted into a location in setLocation(). eg.

/**
 * Limit data to a week starting at given day. If day is 0, month is assumed.
 */
public WeeklyOrMonthlyTextLoader(String year, String month, String day) {
    super();
    pathGlob = getPathGlob(
        Integer.parseInt(year),
        Integer.parseInt(month),
        Integer.parseInt(day)
    );
}

/**
 * Replace DATE_PATH in location with glob required for reading in this
 * month or week of data. This assumes the following directory structure:
 *
 * <code>/year=&gt;year&lt;/month=&gt;month&lt;/day=&gt;day&lt;/*</code>
 */
@Override
public void setLocation(String location, Job job) throws IOException {
    location = location.replace(GLOB_PLACEHOLDER, pathGlob);
    super.setLocation(location, job);
}

然后从Pig脚本中调用它,如下所示:

This is then be called from a Pig script like so:

DEFINE TextLoader com.foo.WeeklyOrMonthlyTextLoader('$year', '$month', '$day');

请注意,构造函数接受String,而不接受int.这是因为Pig中的参数是字符串,不能在Pig脚本中强制转换或转换为其他类型(除非在MR任务中使用).

Note that the constructor accepts String, not int. This is because parameters in Pig are strings and cannot be cast or converted to other types within the Pig script (except when used in MR tasks).

尽管与包装脚本相比,创建自定义LoadFunc似乎有些过分,但我希望解决方案是纯Pig,以避免迫使分析人员在使用脚本之前执行设置任务.我还想在为计划的作业创建Amazon MapReduce集群时方便地在不同时期使用股票Pig脚本.

While creating a custom LoadFunc may seem overkill compared to a wrapper script, I wanted the solution to be pure Pig to avoid forcing analysts to perform a setup task before working with their scripts. I also wanted to readily use a stock Pig script for different periods when creating an Amazon MapReduce cluster for a scheduled job.

这篇关于基于感兴趣的日期范围作为参数输入,限制在Pig Latin中加载日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆