猪拉丁文:从日期范围加载多个文件(目录结构的一部分) [英] Pig Latin: Load multiple files from a date range (part of the directory structure)
问题描述
我有以下情况 -
Pig版本使用了0.70
HDFS目录结构示例:
p> / user / training / test / 20100810 /< data files>
/ user / training / test / 20100811 /< data files>
/ user / training / test / 20100812 /<资料档案>
/ user / training / test / 20100813 /<资料档案>
/ user / training / test / 20100814 /<资料档案>
正如您在上面列出的路径中看到的,其中一个目录名称是日期标记。 / p>
问题:我想从日期范围20100810到20100813加载文件。
我可以将日期范围的'from'和'to'作为参数传递给Pig脚本,但是如何在LOAD语句中使用这些参数。我可以做到以下几点:$ b
$ b pre $ temp $ LOAD'/ user / training / test / {20100810,20100811,20100812}'使用SomeLoader()AS(...);
以下内容适用于hadoop:
hadoop fs -ls /user/training/test/{20100810..20100813}
但是当我在猪脚本里面加载LOAD时,它失败了。如何使用传递给Pig脚本的参数来加载日期范围内的数据?
错误日志如下:
作业提交期间的后端错误消息
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException:错误2118:无法创建输入分割:hdfs://< ServerName> .com / user / training / test / {20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
在org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
在org.apache.hadoop.mapred.JobClient.access $ 500(JobClient.java:170)
在org .apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:752)$ java.util.b $ b .AccessController.doPrivileged(本地方法)
位于javax.security.auth.Subject.doAs(Subject.ja va:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
引起:org.apache.hadoop.mapreduce.lib.input.InvalidInputException:输入模式hdfs://< ServerName> .com /user/training/test/{20100810..20100813}与org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)中的
匹配0个文件
。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInput Format.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine。 mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more
Pig堆栈跟踪
------ ---------
错误2997:无法从后端错误重新创建异常:org.apache.pig.backend.executionengine.ExecException:错误2118:无法创建输入拆分:hdfs:// < ServerName> .com / user / training / test / {20100810..20100813}
org.apache.pig.impl.logicalLayer.FrontendException:错误1066:无法打开迭代器for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntPa rser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt。 Grunt.run(Grunt.java:75)
在org.apache.pig.Main.main(Main.java:357)
引起:org.apache.pig.backend.executionengine.ExecException:错误2997:无法从后端错误重新创建异常:org.apache.pig.backend.executionengine.ExecException:错误2118:无法创建输入拆分:hdfs://< ServerName> .com / user / training / test / {20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
我是否需要使用像Python这样的高级语言来捕获范围内的所有日期戳,并将它们作为逗号分隔的列表传递给LOAD?
欢呼
Pig正在使用hadoop文件处理文件名实用程序,而不是shell的glob实用程序。 Hadoop的记录 这里 。正如你所看到的,hadoop不支持范围内的'..'运算符。在我看来,你有两个选择 - 要么手动写出 {date1,date2,date2,...,dateN}
列表,如果这是罕见的用法情况可能是要走的路,或者写一个包装脚本来为你生成这个列表。从日期范围构建这样的列表对于您选择的脚本语言来说应该是一项简单的任务。对于我的应用程序,我使用了生成的列表路线,并且它工作正常(CHD3分发)。
I have the following scenario-
Pig version used 0.70
Sample HDFS directory structure:
/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>
As you can see in the paths listed above, one of the directory names is a date stamp.
Problem: I want to load files from a date range say from 20100810 to 20100813.
I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following
temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);
The following works with hadoop:
hadoop fs -ls /user/training/test/{20100810..20100813}
But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?
Error log follows:
Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?
cheers
Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN}
list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).
这篇关于猪拉丁文:从日期范围加载多个文件(目录结构的一部分)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!