从HDFS加载数据不适用于Elephantbird [英] Loading data from HDFS does not work with Elephantbird
问题描述
我试图在猪身上处理大象鸟的数据,但我没有成功加载数据。这是我的猪脚本:
register'lib / elephant-bird-core-3.0.9.jar';
注册'lib / elephant-bird-pig-3.0.9.jar';
注册'lib / google-collections-1.0.jar';
注册'lib / json-simple-1.1.jar';
twitter = LOAD'statuses.log.2013-04-01-00'
USING com.twitter.elephantbird.pig.load.JsonLoader(' - nestedLoad');
DUMP twitter;
我得到的输出是
<$ p $ Apache Pig版本0.11.0-cdh4.3.0(rexported)2013年5月27日20:48:21编译
[main] INFO org.apache.pig.Main - Apache Pig版本0.11.0-cdh4.3.0 INFO org.apache.pig.Main - 将错误消息记录到:/home/hadoop1/twitter_test/pig_1374834826168.log
[main] INFO org.apache.pig.impl.util.Utils - 默认启动文件/ home / hadoop1 / .pigbootup找不到
[main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到hadoop文件系统:hdfs://master.hadoop:8020
[main ] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到map-reduce作业追踪器:master.hadoop:8021
[main] INFO org.apache.pig.tools.pigstats.ScriptState - 猪脚本中使用的特征:UNKNOWN
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - 文件连接阈值:100乐观? false
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化前的MR计划大小:1
[main] INFO org.apache.pig.backend.hadoop。 executionengine.mapReduceLayer.MultiQueryOptimizer - 优化后的MR计划大小:1
[main] WARN org.apache.pig.backend.hadoop23.PigJobControl - 回退到默认JobControl(不使用hadoop 0.23?)
java .lang.NoSuchFieldException:jobsInProgress
。在java.lang.Class.getDeclaredField(Class.java:1938)
。在org.apache.pig.backend.hadoop23.PigJobControl< clinit>(PigJobControl.java。 :58)
处org.apache.pig.backend.hadoop.executionengine.mapReduceLayer org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:102)
在.JobControlCompiler.compile(JobControlCompiler.java:285)
。在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
或g.apache.pig.PigServer.launchPlan(PigServer.java:1266)
位于org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
位于org.apache.pig.PigServer。 storeEx(PigServer.java:933)
在org.apache.pig.PigServer.store(PigServer.java:900)
在org.apache.pig.PigServer.openIterator(PigServer.java:813)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java: 320)
。在org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
。在org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java: 170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Nat iveMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
[main] INFO org.apache.pig.tools.pigstats.ScriptState - 将猪脚本设置添加到作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent未设置,默认设置为0.3
[main] INFO org。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 使用reducer estimator:org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
[main] INFO org.apache.pig.backend.hadoop .executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer = 1000000000 maxReducers = 999 totalInfileSize = 656085089
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .JobControlCompiler - 将并行性设置为1
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 创建jar文件Job6015425922938886053.jar
[main] INFO org.apache.pig .backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar文件Job6015425922938886053.jar创建
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 设置单个存储作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce等待提交的作业。
[JobControl] WARN org.apache.hadoop.mapred.JobClient - 使用GenericOptionsParser解析参数。应用程序应该实现相同的工具。
[JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 要处理的输入路径总数:1
[JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util .MapRedUtil - 要处理的总输入路径(合并):5
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 完成0%
[main] INFO org。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId:job_201307261031_0050
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 处理别名twitter
[main ] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 详细位置:M:twitter [10,10] C:R:
[main] INFO org.apache.pig.backend.hadoop .executionengine.mapReduceLayer.MapReduceLauncher - 的更多信息:HTTP://master.hadoop:50030 / jobdetails.jsp作业ID = job_201307261031_0050
[主要] WARN org.apache.pig.backend.hadoop.e xecutionengine.mapReduceLayer.MapReduceLauncher - Ooops!有些工作失败了!如果希望Pig在失败时立即停止,请指定-stop_on_failure。
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307261031_0050失败!停止运行所有依赖作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100%完成
[main] ERROR org.apache.pig.tools.pigstats。 SimplePigStats - 错误2997:无法重新创建支持错误的异常:错误:找到接口org.apache.hadoop.mapreduce.Counter,但期望类
[main]错误org.apache.pig.tools.pigstats.PigStatsUtil - 1张地图减少工作失败!
[main] INFO org.apache.pig.tools.pigstats.SimplePigStats - 脚本统计:
HadoopVersion PigVersion UserId StartedAt已完成特征
2.0.0-cdh4.3.0 0.11。 0-cdh4.3.0 hadoop1 2013-07-26 12:33:48 2013-07-26 12:34:23 UNKNOWN
失败!
失败的作业:
JobId别名功能消息输出
job_201307261031_0050 twitter MAP_ONLY消息:作业失败! hdfs://master.hadoop:8020 / tmp / temp971280905 / tmp1376631504,
输入:
无法从hdfs://master.hadoop:8020 / user /hadoop1/statuses.log.2013-04-01-00
输出:
未能在hdfs://master.hadoop:8020 / tmp / temp971280905 / tmp1376631504
计数器:
写入的总记录数:0
写入的字节总数:0
Spillable内存管理器溢出计数:0
spilled:0
主动泄漏的总记录数:0
工作DAG:
job_201307261031_0050
[main] INFO org.apache.pig .backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败!
[main] ERROR org.apache.pig.tools.grunt.Grunt - 错误2997:无法重新创建支持错误的异常:错误:找到的接口org.apache.hadoop.mapreduce.Counter,但期望类为
日志文件的详细信息:/home/hadoop1/twitter_test/pig_1374834826168.log
该文件存在且是可访问的:
$ hdfs dfs -ls /user/hadoop1/statuses.log.2013-04-01-00
找到1项
-rw-r - r-- 3 hadoop1 supergroup 656085089 2013-07-26 11:53 /user/hadoop1/statuses.log.2013-04-01-00
这似乎是Cloudera 4.6.0附带的猪版本的一个常见问题:问题似乎是线路那说
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - 错误2997:无法重新创建异常支持的错误:错误:找到的接口org.apache.hadoop.mapreduce.Counter,但期望类
我有一个类似的错误运行另一个用户定义的函数来加载数据:
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - 错误2997:无法从支持的错误重新创建异常:错误:找到接口org.apache.hadoop.mapreduce.TaskAttemptContext,但期望类
当我强迫猪进入本地模式时(''-x local''),我得到了更明显的错误
导致:java.lang.IncompatibleClassChangeError:找到接口org.apache.hadoop.mapreduce.TaskAttemptContext,但期望类
所以Hadoop猪的使用版本似乎与Cloudera附带的版本不兼容,我想。这确实是一个版本控制问题:某些库还不能与新的MapReduce API兼容,请参阅例如
a href =https://github.com/twitter/hadoop-lzo/issues/56 =nofollow>#56 ,#247 和 #308 。
ElephantBird的问题是最近版本解决了。在上面的代码中使用ElephantBird 4.1并添加Hadoop兼容性模块
register'lib / elephant-bird-core-4.1.jar ;
注册'lib / elephant-bird-pig-4.1.jar';
注册'lib / elephant-bird-hadoop-compat-4.1.jar';
注册'lib / google-collections-1.0.jar';
注册'lib / json-simple-1.1.jar';
解决了这个问题! : - )
I am trying to process data with elephantbird in pig but I don't succeed in loading the data. Here is my pig script:
register 'lib/elephant-bird-core-3.0.9.jar';
register 'lib/elephant-bird-pig-3.0.9.jar';
register 'lib/google-collections-1.0.jar';
register 'lib/json-simple-1.1.jar';
twitter = LOAD 'statuses.log.2013-04-01-00'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
DUMP twitter;
The output I get is
[main] INFO org.apache.pig.Main - Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21
[main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop1/twitter_test/pig_1374834826168.log
[main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop1/.pigbootup not found
[main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master.hadoop:8020
[main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master.hadoop:8021
[main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
[main] WARN org.apache.pig.backend.hadoop23.PigJobControl - falling back to default JobControl (not using hadoop 0.23 ?)
java.lang.NoSuchFieldException: jobsInProgress
at java.lang.Class.getDeclaredField(Class.java:1938)
at org.apache.pig.backend.hadoop23.PigJobControl.<clinit>(PigJobControl.java:58)
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:102)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:285)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1266)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
at org.apache.pig.PigServer.storeEx(PigServer.java:933)
at org.apache.pig.PigServer.store(PigServer.java:900)
at org.apache.pig.PigServer.openIterator(PigServer.java:813)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
[main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=656085089
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job6015425922938886053.jar
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job6015425922938886053.jar created
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
[JobControl] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
[JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 5
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201307261031_0050
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases twitter
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: twitter[10,10] C: R:
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://master.hadoop:50030/jobdetails.jsp?jobid=job_201307261031_0050
[main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307261031_0050 has failed! Stop running all dependent jobs
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.0.0-cdh4.3.0 0.11.0-cdh4.3.0 hadoop1 2013-07-26 12:33:48 2013-07-26 12:34:23 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201307261031_0050 twitter MAP_ONLY Message: Job failed! hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504,
Input(s):
Failed to read data from "hdfs://master.hadoop:8020/user/hadoop1/statuses.log.2013-04-01-00"
Output(s):
Failed to produce result in "hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201307261031_0050
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
Details at logfile: /home/hadoop1/twitter_test/pig_1374834826168.log
The file exists and is accessible:
$ hdfs dfs -ls /user/hadoop1/statuses.log.2013-04-01-00
Found 1 items
-rw-r--r-- 3 hadoop1 supergroup 656085089 2013-07-26 11:53 /user/hadoop1/statuses.log.2013-04-01-00
This seems to be a general problem with the pig version shipped with Cloudera 4.6.0: the problem seems to be the line that says
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
I got a similar error when running another user defined function for loading data:
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
When I force pig to local mode (''-x local'') I get the more obvious error
Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
So the version of Hadoop pig uses seems to be incompatible with the one shipped with Cloudera, I guess.
This is indeed a versioning problem: some libraries are not yet compatible with the new MapReduce API, see for example the issues #56, #247 and #308. For ElephantBird the issue is solved in a recent version. Using ElephantBird 4.1 in the above code and adding the Hadoop compatibility module
register 'lib/elephant-bird-core-4.1.jar';
register 'lib/elephant-bird-pig-4.1.jar';
register 'lib/elephant-bird-hadoop-compat-4.1.jar';
register 'lib/google-collections-1.0.jar';
register 'lib/json-simple-1.1.jar';
solved the problem! :-)
这篇关于从HDFS加载数据不适用于Elephantbird的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!