从HDFS加载数据不适用于Elephantbird [英] Loading data from HDFS does not work with Elephantbird

查看:125
本文介绍了从HDFS加载数据不适用于Elephantbird的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在猪身上处理大象鸟的数据,但我没有成功加载数据。这是我的猪脚本:

  register'lib / elephant-bird-core-3.0.9.jar'; 
注册'lib / elephant-bird-pig-3.0.9.jar';
注册'lib / google-collections-1.0.jar';
注册'lib / json-simple-1.1.jar';

twitter = LOAD'statuses.log.2013-04-01-00'
USING com.twitter.elephantbird.pig.load.JsonLoader(' - nestedLoad');

DUMP twitter;

我得到的输出是
<$ p $ Apache Pig版本0.11.0-cdh4.3.0(rexported)2013年5月27日20:48:21编译
[main] INFO org.apache.pig.Main - Apache Pig版本0.11.0-cdh4.3.0 INFO org.apache.pig.Main - 将错误消息记录到:/home/hadoop1/twitter_test/pig_1374834826168.log
[main] INFO org.apache.pig.impl.util.Utils - 默认启动文件/ home / hadoop1 / .pigbootup找不到
[main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到hadoop文件系统:hdfs://master.hadoop:8020
[main ] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到map-reduce作业追踪器:master.hadoop:8021
[main] INFO org.apache.pig.tools.pigstats.ScriptState - 猪脚本中使用的特征:UNKNOWN
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - 文件连接阈值:100乐观? false
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化前的MR计划大小:1
[main] INFO org.apache.pig.backend.hadoop。 executionengine.mapReduceLayer.MultiQueryOptimizer - 优化后的MR计划大小:1
[main] WARN org.apache.pig.backend.hadoop23.PigJobControl - 回退到默认JobControl(不使用hadoop 0.23?)
java .lang.NoSuchFieldException:jobsInProgress
。在java.lang.Class.getDeclaredField(Class.java:1938)
。在org.apache.pig.backend.hadoop23.PigJobControl< clinit>(PigJobControl.java。 :58)
处org.apache.pig.backend.hadoop.executionengine.mapReduceLayer org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:102)
在.JobControlCompiler.compile(JobControlCompiler.java:285)
。在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
或g.apache.pig.PigServer.launchPlan(PigServer.java:1266)
位于org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
位于org.apache.pig.PigServer。 storeEx(PigServer.java:933)
在org.apache.pig.PigServer.store(PigServer.java:900)
在org.apache.pig.PigServer.openIterator(PigServer.java:813)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java: 320)
。在org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
。在org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java: 170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Nat iveMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
[main] INFO org.apache.pig.tools.pigstats.ScriptState - 将猪脚本设置添加到作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent未设置,默认设置为0.3
[main] INFO org。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 使用reducer estimator:org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
[main] INFO org.apache.pig.backend.hadoop .executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer = 1000000000 maxReducers = 999 totalInfileSize = 656085089
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .JobControlCompiler - 将并行性设置为1
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 创建jar文件Job6015425922938886053.jar
[main] INFO org.apache.pig .backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar文件Job6015425922938886053.jar创建
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 设置单个存储作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce等待提交的作业。
[JobControl] WARN org.apache.hadoop.mapred.JobClient - 使用GenericOptionsParser解析参数。应用程序应该实现相同的工具。
[JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 要处理的输入路径总数:1
[JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util .MapRedUtil - 要处理的总输入路径(合并):5
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 完成0%
[main] INFO org。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId:job_201307261031_0050
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 处理别名twitter
[main ] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 详细位置:M:twitter [10,10] C:R:
[main] INFO org.apache.pig.backend.hadoop .executionengine.mapReduceLayer.MapReduceLauncher - 的更多信息:HTTP://master.hadoop:50030 / jobdetails.jsp作业ID = job_201307261031_0050
[主要] WARN org.apache.pig.backend.hadoop.e xecutionengine.mapReduceLayer.MapReduceLauncher - Ooops!有些工作失败了!如果希望Pig在失败时立即停止,请指定-stop_on_failure。
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307261031_0050失败!停止运行所有依赖作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100%完成
[main] ERROR org.apache.pig.tools.pigstats。 SimplePigStats - 错误2997:无法重新创建支持错误的异常:错误:找到接口org.apache.hadoop.mapreduce.Counter,但期望类
[main]错误org.apache.pig.tools.pigstats.PigStatsUtil - 1张地图减少工作失败!
[main] INFO org.apache.pig.tools.pigstats.SimplePigStats - 脚本统计:

HadoopVersion PigVersion UserId StartedAt已完成特征
2.0.0-cdh4.3.0 0.11。 0-cdh4.3.0 hadoop1 2013-07-26 12:33:48 2013-07-26 12:34:23 UNKNOWN

失败!

失败的作业:
JobId别名功能消息输出
job_201307261031_0050 twitter MAP_ONLY消息:作业失败! hdfs://master.hadoop:8020 / tmp / temp971280905 / tmp1376631504,

输入:
无法从hdfs://master.hadoop:8020 / user /hadoop1/statuses.log.2013-04-01-00

输出:
未能在hdfs://master.hadoop:8020 / tmp / temp971280905 / tmp1376631504

计数器:
写入的总记录数:0
写入的字节总数:0
Spillable内存管理器溢出计数:0
spilled:0
主动泄漏的总记录数:0

工作DAG:
job_201307261031_0050


[main] INFO org.apache.pig .backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败!
[main] ERROR org.apache.pig.tools.grunt.Grunt - 错误2997:无法重新创建支持错误的异常:错误:找到的接口org.apache.hadoop.mapreduce.Counter,但期望类为
日志文件的详细信息:/home/hadoop1/twitter_test/pig_1374834826168.log

该文件存在且是可访问的:

  $ hdfs dfs -ls /user/hadoop1/statuses.log.2013-04-01-00 
找到1项
-rw-r - r-- 3 hadoop1 supergroup 656085089 2013-07-26 11:53 /user/hadoop1/statuses.log.2013-04-01-00

这似乎是Cloudera 4.6.0附带的猪版本的一个常见问题:问题似乎是线路那说

  [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 错误2997:无法重新创建异常支持的错误:错误:找到的接口org.apache.hadoop.mapreduce.Counter,但期望类

我有一个类似的错误运行另一个用户定义的函数来加载数据:

  [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 错误2997:无法从支持的错误重新创建异常:错误:找到接口org.apache.hadoop.mapreduce.TaskAttemptContext,但期望类

当我强迫猪进入本地模式时(''-x local''),我得到了更明显的错误

 导致:java.lang.IncompatibleClassChangeError:找到接口org.apache.hadoop.mapreduce.TaskAttemptContext,但期望类

所以Hadoop猪的使用版本似乎与Cloudera附带的版本不兼容,我想。这确实是一个版本控制问题:某些库还不能与新的MapReduce API兼容,请参阅例如

a href =https://github.com/twitter/hadoop-lzo/issues/56 =nofollow>#56 ,#247 #308
ElephantBird的问题是最近版本解决了。在上面的代码中使用ElephantBird 4.1并添加Hadoop兼容性模块

  register'lib / elephant-bird-core-4.1.jar ; 
注册'lib / elephant-bird-pig-4.1.jar';
注册'lib / elephant-bird-hadoop-compat-4.1.jar';
注册'lib / google-collections-1.0.jar';
注册'lib / json-simple-1.1.jar';

解决了这个问题! : - )

I am trying to process data with elephantbird in pig but I don't succeed in loading the data. Here is my pig script:

register 'lib/elephant-bird-core-3.0.9.jar';
register 'lib/elephant-bird-pig-3.0.9.jar';
register 'lib/google-collections-1.0.jar';
register 'lib/json-simple-1.1.jar';

twitter = LOAD 'statuses.log.2013-04-01-00' 
          USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

DUMP twitter;

The output I get is

[main] INFO  org.apache.pig.Main - Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21
[main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop1/twitter_test/pig_1374834826168.log
[main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop1/.pigbootup not found
[main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master.hadoop:8020
[main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master.hadoop:8021
[main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
[main] WARN  org.apache.pig.backend.hadoop23.PigJobControl - falling back to default JobControl (not using hadoop 0.23 ?)
java.lang.NoSuchFieldException: jobsInProgress
    at java.lang.Class.getDeclaredField(Class.java:1938)
    at org.apache.pig.backend.hadoop23.PigJobControl.<clinit>(PigJobControl.java:58)
    at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:102)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:285)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
    at org.apache.pig.PigServer.launchPlan(PigServer.java:1266)
    at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
    at org.apache.pig.PigServer.storeEx(PigServer.java:933)
    at org.apache.pig.PigServer.store(PigServer.java:900)
    at org.apache.pig.PigServer.openIterator(PigServer.java:813)
    at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:604)
    at org.apache.pig.Main.main(Main.java:157)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
[main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=656085089
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job6015425922938886053.jar
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job6015425922938886053.jar created
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
[JobControl] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
[JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 5
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201307261031_0050
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases twitter
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: twitter[10,10] C:  R: 
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://master.hadoop:50030/jobdetails.jsp?jobid=job_201307261031_0050
[main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307261031_0050 has failed! Stop running all dependent jobs
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.0.0-cdh4.3.0  0.11.0-cdh4.3.0 hadoop1 2013-07-26 12:33:48 2013-07-26 12:34:23 UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201307261031_0050   twitter MAP_ONLY    Message: Job failed!    hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504,

Input(s):
Failed to read data from "hdfs://master.hadoop:8020/user/hadoop1/statuses.log.2013-04-01-00"

Output(s):
Failed to produce result in "hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201307261031_0050


[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
Details at logfile: /home/hadoop1/twitter_test/pig_1374834826168.log

The file exists and is accessible:

$ hdfs dfs -ls /user/hadoop1/statuses.log.2013-04-01-00
Found 1 items
-rw-r--r--   3 hadoop1 supergroup  656085089 2013-07-26 11:53 /user/hadoop1/statuses.log.2013-04-01-00

This seems to be a general problem with the pig version shipped with Cloudera 4.6.0: the problem seems to be the line that says

[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected

I got a similar error when running another user defined function for loading data:

[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

When I force pig to local mode (''-x local'') I get the more obvious error

Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

So the version of Hadoop pig uses seems to be incompatible with the one shipped with Cloudera, I guess.

解决方案

This is indeed a versioning problem: some libraries are not yet compatible with the new MapReduce API, see for example the issues #56, #247 and #308. For ElephantBird the issue is solved in a recent version. Using ElephantBird 4.1 in the above code and adding the Hadoop compatibility module

register 'lib/elephant-bird-core-4.1.jar';
register 'lib/elephant-bird-pig-4.1.jar';
register 'lib/elephant-bird-hadoop-compat-4.1.jar';
register 'lib/google-collections-1.0.jar';
register 'lib/json-simple-1.1.jar';

solved the problem! :-)

这篇关于从HDFS加载数据不适用于Elephantbird的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆