从HDFS加载数据不适用于Elephantbird [英] Loading data from HDFS does not work with Elephantbird

查看：125 发布时间：2018/5/31 19:17:57 hadoop apache-pig cloudera elephantbird

本文介绍了从HDFS加载数据不适用于Elephantbird的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在猪身上处理大象鸟的数据，但我没有成功加载数据。这是我的猪脚本：

  register'lib / elephant-bird-core-3.0.9.jar'; 
注册'lib / elephant-bird-pig-3.0.9.jar'; 
注册'lib / google-collections-1.0.jar'; 
注册'lib / json-simple-1.1.jar'; 
 
 twitter = LOAD'statuses.log.2013-04-01-00'
 USING com.twitter.elephantbird.pig.load.JsonLoader（' -  nestedLoad'）; 
 
 DUMP twitter;

我得到的输出是
<$ p $ Apache Pig版本0.11.0-cdh4.3.0（rexported）2013年5月27日20:48:21编译
[main] INFO org.apache.pig.Main - Apache Pig版本0.11.0-cdh4.3.0 INFO org.apache.pig.Main - 将错误消息记录到：/home/hadoop1/twitter_test/pig_1374834826168.log
[main] INFO org.apache.pig.impl.util.Utils - 默认启动文件/ home / hadoop1 / .pigbootup找不到
[main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到hadoop文件系统：hdfs：//master.hadoop：8020
[main ] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - 连接到map-reduce作业追踪器：master.hadoop：8021
[main] INFO org.apache.pig.tools.pigstats.ScriptState - 猪脚本中使用的特征：UNKNOWN
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - 文件连接阈值：100乐观？ false
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化前的MR计划大小：1
[main] INFO org.apache.pig.backend.hadoop。 executionengine.mapReduceLayer.MultiQueryOptimizer - 优化后的MR计划大小：1
[main] WARN org.apache.pig.backend.hadoop23.PigJobControl - 回退到默认JobControl（不使用hadoop 0.23？）
java .lang.NoSuchFieldException：jobsInProgress
。在java.lang.Class.getDeclaredField（Class.java:1938）
。在org.apache.pig.backend.hadoop23.PigJobControl< clinit>（PigJobControl.java。：58）
处org.apache.pig.backend.hadoop.executionengine.mapReduceLayer org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl（HadoopShims.java:102）
在.JobControlCompiler.compile（JobControlCompiler.java:285）
。在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig（MapReduceLauncher.java:177）
或g.apache.pig.PigServer.launchPlan（PigServer.java:1266）
位于org.apache.pig.PigServer.executeCompiledLogicalPlan（PigServer.java:1251）
位于org.apache.pig.PigServer。 storeEx（PigServer.java:933）
在org.apache.pig.PigServer.store（PigServer.java:900）
在org.apache.pig.PigServer.openIterator（PigServer.java:813）
at org.apache.pig.tools.grunt.GruntParser.processDump（GruntParser.java:696）
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse（PigScriptParser.java： 320）
。在org.apache.pig.tools.grunt.GruntParser.parseStopOnError（GruntParser.java:194）
。在org.apache.pig.tools.grunt.GruntParser.parseStopOnError（GruntParser.java： 170）
at org.apache.pig.tools.grunt.Grunt.exec（Grunt.java:84）
at org.apache.pig.Main.run（Main.java:604）
at org.apache.pig.Main.main（Main.java:157）
at sun.reflect.NativeMethodAccessorImpl.invoke0（Native Method）
at sun.reflect.NativeMethodAccessorImpl.invoke（Nat iveMethodAccessorImpl.java:57）
at sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43）
at java.lang.reflect.Method.invoke（Method.java:606）
at org.apache.hadoop.util.RunJar.main（RunJar.java:208）
[main] INFO org.apache.pig.tools.pigstats.ScriptState - 将猪脚本设置添加到作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent未设置，默认设置为0.3
[main] INFO org。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 使用reducer estimator：org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
[main] INFO org.apache.pig.backend.hadoop .executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer = 1000000000 maxReducers = 999 totalInfileSize = 656085089
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .JobControlCompiler - 将并行性设置为1
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 创建jar文件Job6015425922938886053.jar
[main] INFO org.apache.pig .backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar文件Job6015425922938886053.jar创建
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 设置单个存储作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce等待提交的作业。
[JobControl] WARN org.apache.hadoop.mapred.JobClient - 使用GenericOptionsParser解析参数。应用程序应该实现相同的工具。
[JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - 要处理的输入路径总数：1
[JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util .MapRedUtil - 要处理的总输入路径（合并）：5
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 完成0％
[main] INFO org。 apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId：job_201307261031_0050
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 处理别名twitter
[main ] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 详细位置：M：twitter [10,10] C：R：
[main] INFO org.apache.pig.backend.hadoop .executionengine.mapReduceLayer.MapReduceLauncher - 的更多信息：HTTP：//master.hadoop：50030 / jobdetails.jsp作业ID = job_201307261031_0050
[主要] WARN org.apache.pig.backend.hadoop.e xecutionengine.mapReduceLayer.MapReduceLauncher - Ooops！有些工作失败了！如果希望Pig在失败时立即停止，请指定-stop_on_failure。
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307261031_0050失败！停止运行所有依赖作业
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100％完成
[main] ERROR org.apache.pig.tools.pigstats。 SimplePigStats - 错误2997：无法重新创建支持错误的异常：错误：找到接口org.apache.hadoop.mapreduce.Counter，但期望类
[main]错误org.apache.pig.tools.pigstats.PigStatsUtil - 1张地图减少工作失败！
[main] INFO org.apache.pig.tools.pigstats.SimplePigStats - 脚本统计：

HadoopVersion PigVersion UserId StartedAt已完成特征
2.0.0-cdh4.3.0 0.11。 0-cdh4.3.0 hadoop1 2013-07-26 12:33:48 2013-07-26 12:34:23 UNKNOWN

失败！

失败的作业：
JobId别名功能消息输出
job_201307261031_0050 twitter MAP_ONLY消息：作业失败！ hdfs：//master.hadoop：8020 / tmp / temp971280905 / tmp1376631504，

输入：
无法从hdfs：//master.hadoop：8020 / user /hadoop1/statuses.log.2013-04-01-00

输出：
未能在hdfs：//master.hadoop：8020 / tmp / temp971280905 / tmp1376631504

计数器：
写入的总记录数：0
写入的字节总数：0
Spillable内存管理器溢出计数：0
spilled：0
主动泄漏的总记录数：0

工作DAG：
job_201307261031_0050

[main] INFO org.apache.pig .backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败！
[main] ERROR org.apache.pig.tools.grunt.Grunt - 错误2997：无法重新创建支持错误的异常：错误：找到的接口org.apache.hadoop.mapreduce.Counter，但期望类为
日志文件的详细信息：/home/hadoop1/twitter_test/pig_1374834826168.log

该文件存在且是可访问的：

  $ hdfs dfs -ls /user/hadoop1/statuses.log.2013-04-01-00 
找到1项
 -rw-r  -  r-- 3 hadoop1 supergroup 656085089 2013-07-26 11:53 /user/hadoop1/statuses.log.2013-04-01-00

这似乎是Cloudera 4.6.0附带的猪版本的一个常见问题：问题似乎是线路那说

  [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 错误2997：无法重新创建异常支持的错误：错误：找到的接口org.apache.hadoop.mapreduce.Counter，但期望类

我有一个类似的错误运行另一个用户定义的函数来加载数据：

  [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats  - 错误2997：无法从支持的错误重新创建异常：错误：找到接口org.apache.hadoop.mapreduce.TaskAttemptContext，但期望类

当我强迫猪进入本地模式时（''-x local''），我得到了更明显的错误

 导致：java.lang.IncompatibleClassChangeError：找到接口org.apache.hadoop.mapreduce.TaskAttemptContext，但期望类

所以Hadoop猪的使用版本似乎与Cloudera附带的版本不兼容，我想。这确实是一个版本控制问题：某些库还不能与新的MapReduce API兼容，请参阅例如

a href =https://github.com/twitter/hadoop-lzo/issues/56 =nofollow>＃56 ，＃247 和＃308 。
ElephantBird的问题是最近版本解决了。在上面的代码中使用ElephantBird 4.1并添加Hadoop兼容性模块

  register'lib / elephant-bird-core-4.1.jar ; 
注册'lib / elephant-bird-pig-4.1.jar'; 
注册'lib / elephant-bird-hadoop-compat-4.1.jar'; 
注册'lib / google-collections-1.0.jar'; 
注册'lib / json-simple-1.1.jar';

解决了这个问题！： - ）

I am trying to process data with elephantbird in pig but I don't succeed in loading the data. Here is my pig script:

register 'lib/elephant-bird-core-3.0.9.jar';
register 'lib/elephant-bird-pig-3.0.9.jar';
register 'lib/google-collections-1.0.jar';
register 'lib/json-simple-1.1.jar';

twitter = LOAD 'statuses.log.2013-04-01-00' 
          USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

DUMP twitter;

The output I get is

[main] INFO  org.apache.pig.Main - Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21
[main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop1/twitter_test/pig_1374834826168.log
[main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop1/.pigbootup not found
[main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master.hadoop:8020
[main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: master.hadoop:8021
[main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
[main] WARN  org.apache.pig.backend.hadoop23.PigJobControl - falling back to default JobControl (not using hadoop 0.23 ?)
java.lang.NoSuchFieldException: jobsInProgress
    at java.lang.Class.getDeclaredField(Class.java:1938)
    at org.apache.pig.backend.hadoop23.PigJobControl.<clinit>(PigJobControl.java:58)
    at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:102)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:285)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
    at org.apache.pig.PigServer.launchPlan(PigServer.java:1266)
    at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
    at org.apache.pig.PigServer.storeEx(PigServer.java:933)
    at org.apache.pig.PigServer.store(PigServer.java:900)
    at org.apache.pig.PigServer.openIterator(PigServer.java:813)
    at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:604)
    at org.apache.pig.Main.main(Main.java:157)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
[main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=656085089
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job6015425922938886053.jar
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job6015425922938886053.jar created
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
[JobControl] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
[JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 5
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201307261031_0050
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases twitter
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: twitter[10,10] C:  R: 
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://master.hadoop:50030/jobdetails.jsp?jobid=job_201307261031_0050
[main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307261031_0050 has failed! Stop running all dependent jobs
[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion   PigVersion  UserId  StartedAt   FinishedAt  Features
2.0.0-cdh4.3.0  0.11.0-cdh4.3.0 hadoop1 2013-07-26 12:33:48 2013-07-26 12:34:23 UNKNOWN

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201307261031_0050   twitter MAP_ONLY    Message: Job failed!    hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504,

Input(s):
Failed to read data from "hdfs://master.hadoop:8020/user/hadoop1/statuses.log.2013-04-01-00"

Output(s):
Failed to produce result in "hdfs://master.hadoop:8020/tmp/temp971280905/tmp1376631504"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201307261031_0050


[main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
Details at logfile: /home/hadoop1/twitter_test/pig_1374834826168.log

The file exists and is accessible:

$ hdfs dfs -ls /user/hadoop1/statuses.log.2013-04-01-00
Found 1 items
-rw-r--r--   3 hadoop1 supergroup  656085089 2013-07-26 11:53 /user/hadoop1/statuses.log.2013-04-01-00

This seems to be a general problem with the pig version shipped with Cloudera 4.6.0: the problem seems to be the line that says

[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected

I got a similar error when running another user defined function for loading data:

[main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

When I force pig to local mode (''-x local'') I get the more obvious error

Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

So the version of Hadoop pig uses seems to be incompatible with the one shipped with Cloudera, I guess.

解决方案

This is indeed a versioning problem: some libraries are not yet compatible with the new MapReduce API, see for example the issues #56, #247 and #308. For ElephantBird the issue is solved in a recent version. Using ElephantBird 4.1 in the above code and adding the Hadoop compatibility module

register 'lib/elephant-bird-core-4.1.jar';
register 'lib/elephant-bird-pig-4.1.jar';
register 'lib/elephant-bird-hadoop-compat-4.1.jar';
register 'lib/google-collections-1.0.jar';
register 'lib/json-simple-1.1.jar';

solved the problem! :-)

这篇关于从HDFS加载数据不适用于Elephantbird的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从HDFS加载数据不适用于Elephantbird [英] Loading data from HDFS does not work with Elephantbird

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

从HDFS加载数据不适用于Elephantbird [英] Loading data from HDFS does not work with Elephantbird

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭