如何使用apache pig加载hadoop集群上的文件? [英] how to load files on hadoop cluster using apache pig?

查看:89
本文介绍了如何使用apache pig加载hadoop集群上的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个猪脚本,需要从本地hadoop集群加载文件。我可以使用hadoop命令列出文件:hadoop fs -ls / repo / mydata,`
但是当我试图在猪脚本中加载文件时,它失败了。 load语句如下:

  in = LOAD'/ repo / mydata / 2012/02'使用PigStorage()AS事件:chararray,用户:chararray)

错误信息是:

 消息:org.apache.pig.backend.executionengine.ExecException:错误2118:输入路径不存在:file:/ repo / mydata / 2012/02 

有什么想法吗?谢谢

解决方案

我的建议:


  1. 在hdfs中创建一个文件夹: hadoop fs -mkdir / pigdata

  2. 加载文件创建的hdfs文件夹: hadoop fs -put /opt/pig/tutorial/data/excite-small.log/ pigdata


(或者你可以从grunt shell作为 grunt> copyFromLocal /opt/pig/tutorial/data/excite-small.log / pigdata


  1. 执行pig latin脚本:

      grunt>在

    grunt>上设置调试set job.name'first-p2-job'

    grunt> log = LOAD'hdfs:// hostname:54310 / pigdata / excite-small.log'AS
    (user:chararray,time:long,query:chararray);
    grunt> grpd = GROUP log BY用户;
    grunt> cntd = FOREACH grpd GENERATE组,COUNT(log);
    grunt> STORE cntd INTO'output';


  2. 输出文件将存储在 hdfs://主机名:54310 / pigdata / output



I have a pig script, and need to load files from local hadoop cluster. I can list the files using hadoop command: hadoop fs –ls /repo/mydata,` but when i tried to load files in pig script, it failed. the load statement is like this:

in = LOAD '/repo/mydata/2012/02' USING PigStorage() AS (event:chararray, user:chararray)

the error message is:

Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: file:/repo/mydata/2012/02

any idea? thanks

解决方案

My suggestion:

  1. Create a folder in hdfs : hadoop fs -mkdir /pigdata

  2. Load the file to the created hdfs folder: hadoop fs -put /opt/pig/tutorial/data/excite-small.log /pigdata

(or you can do it from grunt shell as grunt> copyFromLocal /opt/pig/tutorial/data/excite-small.log /pigdata)

  1. Execute the pig latin script :

       grunt> set debug on
    
       grunt> set job.name 'first-p2-job'
    
       grunt> log = LOAD 'hdfs://hostname:54310/pigdata/excite-small.log' AS 
                  (user:chararray, time:long, query:chararray); 
       grunt> grpd = GROUP log BY user; 
       grunt> cntd = FOREACH grpd GENERATE group, COUNT(log); 
       grunt> STORE cntd INTO 'output';
    

  2. The output file will be stored in hdfs://hostname:54310/pigdata/output

这篇关于如何使用apache pig加载hadoop集群上的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆