根据感兴趣的日期范围作为参数输入限制在 Pig Latin 中加载日志文件 [英] Restricting loading of log files in Pig Latin based on interested date range as parameter input

查看:29
本文介绍了根据感兴趣的日期范围作为参数输入限制在 Pig Latin 中加载日志文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在根据参数输入加载日志文件时遇到问题,想知道是否有人能够提供一些指导.有问题的日志是 Omniture 日志,存储在基于年、月和日的子目录中(例如/year=2013/month=02/day=14),并在文件名中带有日期戳.任何一天都可能存在多个日志,每个日志数百 MB.

我有一个 Pig 脚本,它当前处理整个月的日志,将月和年指定为脚本参数(例如/year=$year/month=$month/day=*).它运行良好,我们对此非常满意.也就是说,我们想切换到每周处理日志,这意味着之前的 LOAD 路径 glob 将不起作用(几周可以包含数月和数年).为了解决这个问题,我有一个 Python UDF,它需要一个开始日期并为一周的日志吐出必要的 glob,例如:

<预><代码>>>>log_path_regex(2013, 1, 28)'{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'

这个 glob 将被插入到适当的路径中:

<代码>>%declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz';>data = LOAD '$omniture_log_path' 使用 OmnitureTextLoader();//见 http://github.com/msukmanowsky/OmnitureTextLoader

不幸的是,我一生都无法弄清楚如何根据 $year、$month 和 $day 脚本参数填充 $week_path.我尝试使用 %declare 但 grunt 抱怨说它的日志记录但从来没有:

<代码>>%declare week_path util.log_path_regex(year,month,day);2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig 版本 0.10.1 (r1426677) 编译于 2012 年 12 月 28 日,16:46:132013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - 将错误消息记录到:/tmp/pig_1360878842643.log % ls/tmp/pig_1360878842643.logls: 无法访问/tmp/pig_1360878842643.log: 没有那个文件或目录

如果我用美元符号作为参数前缀或用引号将前缀参数括起来,则会产生同样的错误.

如果我尝试使用定义(我认为它只适用于静态 Java 函数),我会得到以下结果:

<代码>>定义 week_path util.log_path_regex(year, month, day);2013-02-14 17:00:42,392 [主要] 错误 org.apache.pig.tools.grunt.Grunt - 错误 1200:<file script.pig,第 11 行,第 37 列>不匹配的输入年份"期望 RIGHT_PAREN

与 %declare 一样,如果我用美元符号作为参数前缀或用引号将前缀参数括起来,我会得到同样的错误.

我已经四处搜索了,但没有找到解决方案.我可能正在寻找错误的东西.调用 shell 命令可能有效,但会很困难,因为它会使我们的脚本部署复杂化,而且考虑到我们从 S3 检索日志而不是挂载目录,这可能不可行.同样,将生成的 glob 作为单个参数传递可能会使实例化 MapReduce 集群上的自动化作业复杂化.

除了使用 glob 之外,也可能有一种对 Pig 友好的方式来限制 LOAD.也就是说,我仍然必须使用我的 UDF,这似乎是问题的根源.

这真的归结为我想在我的 LOAD 语句中包含一个在 Pig 内部构建的动态路径 glob.猪似乎没那么容易.

我是否需要将我的 UDF 转换为静态 Java 方法?或者我会遇到同样的问题?(我犹豫是否这样做,因为它可能会起作用.它是一个 8 行 Python 函数,易于部署,并且比等效的 Java 代码更易于其他人维护.)

自定义 LoadFunc 是答案吗?有了这个,我大概必须指定/year=/month=/day=* 并强制 Pig 测试位于两个日期之间的日期戳的每个文件名.这似乎是一个巨大的黑客攻击和资源浪费.

有什么想法吗?

解决方案

I 将此问题发布到 Pig 用户列表.我的理解是,在构建 DAG 之前,Pig 将首先预处理其脚本以替换参数、导入和宏.这使得基于现有变量构建新变量变得有些不可能,并解释了我未能构建 UDF 来构建路径 glob 的原因.

如果您是 Pig 开发人员,需要根据现有参数构建新变量,您可以使用另一个脚本来构建这些变量并将它们作为参数传递给您的 Pig 脚本,或者您可以探索需要在何处使用这些变量新变量并根据您的需要在单独的构造中构建它们.

就我而言,我不情愿地选择创建 自定义 LoadFunc 由 Cheolsoo Park 描述.此 LoadFunc 在其构造函数中接受报告期间开始的日、月和年,并构建一个 pathGlob 属性以匹配该期间的路径.然后将该 pathGlob 插入到 setLocation() 中的位置.例如.

/*** 将数据限制为从给定日期开始的一周.如果日为 0,则假定为月.*/public WeeklyOrMonthlyTextLoader(字符串年份,字符串月份,字符串日期){极好的();pathGlob = getPathGlob(Integer.parseInt(year),Integer.parseInt(month),Integer.parseInt(day));}/*** 将位置中的 DATE_PATH 替换为读取此内容所需的 glob* 一个月或一周的数据.这假定以下目录结构:** <code>/year=&gt;year</month=>month&lt;/day=&gt;day</*</code>*/@覆盖public void setLocation(String location, Job job) 抛出 IOException {location = location.replace(GLOB_PLACEHOLDER, pathGlob);super.setLocation(位置,工作);}

然后从 Pig 脚本中调用它,如下所示:

DEFINE TextLoader com.foo.WeeklyOrMonthlyTextLoader('$year', '$month', '$day');

注意构造函数接受String,而不是int.这是因为 Pig 中的参数是字符串,不能在 Pig 脚本中强制转换或转换为其他类型(除非在 MR 任务中使用).

虽然与包装脚本相比,创建自定义 LoadFunc 似乎有些矫枉过正,但我​​希望解决方案是纯粹的 Pig,以避免迫使分析师在使用他们的脚本之前执行设置任务.在为计划作业创建 Amazon MapReduce 集群时,我还想在不同时期轻松使用库存 Pig 脚本.

I'm having a problem loading log files based on parameter input and was wondering whether someone would be able to provide some guidance. The logs in question are Omniture logs, stored in subdirectories based on year, month, and day (eg. /year=2013/month=02/day=14), and with the date stamp in the filename. For any day, multiple logs could exist, each hundreds of MB.

I have a Pig script which currently processes logs for an entire month, with the month and the year specified as script parameters (eg. /year=$year/month=$month/day=*). It works fine and we're quite happy with it. That said, we want to switch to weekly processing of logs, which means the previous LOAD path glob won't work (weeks can wrap months as well as years). To solve this, I have a Python UDF which takes a start date and spits out the necessary glob for a week's worth of logs, eg:

>>> log_path_regex(2013, 1, 28)
'{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'

This glob will then be inserted in the appropriate path:

> %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz';
> data = LOAD '$omniture_log_path' USING OmnitureTextLoader(); // See http://github.com/msukmanowsky/OmnitureTextLoader

Unfortunately, I can't for the life of me figure out how to populate $week_path based on $year, $month and $day script parameters. I tried using %declare but grunt complains, says its logging but never does:

> %declare week_path util.log_path_regex(year, month, day);
2013-02-14 16:54:02,648 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
2013-02-1416:54:02,648 [main] INFO  org.apache.pig.Main - Logging error messages to: /tmp/pig_1360878842643.log % ls  /tmp/pig_1360878842643.log
ls: cannot access /tmp/pig_1360878842643.log: No such file or directory

The same error results if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.

If I try to use define (which I believe only works for static Java functions), I get the following:

> define week_path util.log_path_regex(year, month, day);
2013-02-14 17:00:42,392 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11, column 37>  mismatched input 'year' expecting RIGHT_PAREN

As with %declare, I get the same error if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.

I've searched around and haven't come up with a solution. I'm possibly searching for the wrong thing. Invoking a shell command may work, but would be difficult as it would complicate our script deploy and may not be feasible given we're retrieving logs from S3 and not a mounted directory. Similarly, passing the generated glob as a single parameter may complicate an automated job on an instantiated MapReduce cluster.

It's also likely there's a nice Pig-friendly way to restrict LOAD other than using globs. That said, I'd still have to use my UDF which seems to be the root of the issue.

This really boils down to me wanting to include a dynamic path glob built inside Pig in my LOAD statement. Pig doesn't seem to be making that easy.

Do I need to convert my UDF to a static Java method? Or will I run into the same issue? (I hesitate to do this on the off-chance it will work. It's an 8-line Python function, readily deployable and much more maintainable by others than the equivalent Java code would be.)

Is a custom LoadFunc the answer? With that, I'd presumably have to specify /year=/month=/day=* and force Pig to test every file name for a date stamp which falls between two dates. That seems like a huge hack and a waste of resources.

Any ideas?

解决方案

I posted this question to the Pig user list. My understanding is that Pig will first pre-process its scripts to substitute parameters, imports and macros before building the DAG. This makes building new variables based on existing ones somewhat impossible, and explains my failure to build a UDF to construct a path glob.

If you are a Pig developer requiring new variables to be built based on existing parameters, you can either use another script to construct those variables and pass them as parameters to your Pig script, or you can explore where you need to use those new variables and build them in a separate construct based on your needs.

In my case, I reluctantly opted to create a custom LoadFunc as described by Cheolsoo Park. This LoadFunc accepts the day, month and year for the beginning of the period for the report in its constructor, and builds a pathGlob attribute to match paths for that period. That pathGlob is then inserted into a location in setLocation(). eg.

/**
 * Limit data to a week starting at given day. If day is 0, month is assumed.
 */
public WeeklyOrMonthlyTextLoader(String year, String month, String day) {
    super();
    pathGlob = getPathGlob(
        Integer.parseInt(year),
        Integer.parseInt(month),
        Integer.parseInt(day)
    );
}

/**
 * Replace DATE_PATH in location with glob required for reading in this
 * month or week of data. This assumes the following directory structure:
 *
 * <code>/year=&gt;year&lt;/month=&gt;month&lt;/day=&gt;day&lt;/*</code>
 */
@Override
public void setLocation(String location, Job job) throws IOException {
    location = location.replace(GLOB_PLACEHOLDER, pathGlob);
    super.setLocation(location, job);
}

This is then be called from a Pig script like so:

DEFINE TextLoader com.foo.WeeklyOrMonthlyTextLoader('$year', '$month', '$day');

Note that the constructor accepts String, not int. This is because parameters in Pig are strings and cannot be cast or converted to other types within the Pig script (except when used in MR tasks).

While creating a custom LoadFunc may seem overkill compared to a wrapper script, I wanted the solution to be pure Pig to avoid forcing analysts to perform a setup task before working with their scripts. I also wanted to readily use a stock Pig script for different periods when creating an Amazon MapReduce cluster for a scheduled job.

这篇关于根据感兴趣的日期范围作为参数输入限制在 Pig Latin 中加载日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆