猪拉丁文:从日期范围加载多个文件(目录结构的一部分) [英] Pig Latin: Load multiple files from a date range (part of the directory structure)

查看:120
本文介绍了猪拉丁文:从日期范围加载多个文件(目录结构的一部分)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下情况 -

Pig版本使用了0.70



HDFS目录结构示例:

p>

  / user / training / test / 20100810 /< data files> 
/ user / training / test / 20100811 /< data files>
/ user / training / test / 20100812 /<资料档案>
/ user / training / test / 20100813 /<资料档案>
/ user / training / test / 20100814 /<资料档案>

正如您在上面列出的路径中看到的,其中一个目录名称是日期标记。 / p>

问题:我想从日期范围20100810到20100813加载文件。



我可以将日期范围的'from'和'to'作为参数传递给Pig脚本,但是如何在LOAD语句中使用这些参数。我可以做到以下几点:$ b​​
$ b pre $ temp $ LOAD'/ user / training / test / {20100810,20100811,20100812}'使用SomeLoader()AS(...);

以下内容适用于hadoop:

  hadoop fs -ls /user/training/test/{20100810..20100813} 

但是当我在猪脚本里面加载LOAD时,它失败了。如何使用传递给Pig脚本的参数来加载日期范围内的数据?



错误日志如下:

 作业提交期间的后端错误消息
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException:错误2118:无法创建输入分割:hdfs://< ServerName> .com / user / training / test / {20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
在org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
在org.apache.hadoop.mapred.JobClient.access $ 500(JobClient.java:170)
在org .apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:752)$ java.util.b $ b .AccessController.doPrivileged(本地方法)
位于javax.security.auth.Subject.doAs(Subject.ja va:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
引起:org.apache.hadoop.mapreduce.lib.input.InvalidInputException:输入模式hdfs://< ServerName> .com /user/training/test/{20100810..20100813}与org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)中的
匹配0个文件
。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInput Format.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine。 mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more



Pig堆栈跟踪
------ ---------
错误2997:无法从后端错误重新创建异常:org.apache.pig.backend.executionengine.ExecException:错误2118:无法创建输入拆分:hdfs:// < ServerName> .com / user / training / test / {20100810..20100813}

org.apache.pig.impl.logicalLayer.FrontendException:错误1066:无法打开迭代器for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntPa rser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt。 Grunt.run(Grunt.java:75)
在org.apache.pig.Main.main(Main.java:357)
引起:org.apache.pig.backend.executionengine.ExecException:错误2997:无法从后端错误重新创建异常:org.apache.pig.backend.executionengine.ExecException:错误2118:无法创建输入拆分:hdfs://< ServerName> .com / user / training / test / {20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

我是否需要使用像Python这样的高级语言来捕获范围内的所有日期戳,并将它们作为逗号分隔的列表传递给LOAD?



欢呼

解决方案

Pig正在使用hadoop文件处理文件名实用程序,而不是shell的glob实用程序。 Hadoop的记录 这里 。正如你所看到的,hadoop不支持范围内的'..'运算符。在我看来,你有两个选择 - 要么手动写出 {date1,date2,date2,...,dateN} 列表,如果这是罕见的用法情况可能是要走的路,或者写一个包装脚本来为你生成这个列表。从日期范围构建这样的列表对于您选择的脚本语言来说应该是一项简单的任务。对于我的应用程序,我使用了生成的列表路线,并且它工作正常(CHD3分发)。


I have the following scenario-

Pig version used 0.70

Sample HDFS directory structure:

/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>

As you can see in the paths listed above, one of the directory names is a date stamp.

Problem: I want to load files from a date range say from 20100810 to 20100813.

I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);

The following works with hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813}

But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?

Error log follows:

Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
        at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
        at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
        ... 14 more



Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
        at org.apache.pig.PigServer.openIterator(PigServer.java:521)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
        at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

cheers

解决方案

Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

这篇关于猪拉丁文:从日期范围加载多个文件(目录结构的一部分)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆