加载由双冒号分隔的文件 :: in pig [英] Load File delimited by double colon :: in pig

查看:21
本文介绍了加载由双冒号分隔的文件 :: in pig的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是由双冒号 (::) 分隔的示例数据集.

1::玩具总动员(1995)::动画|儿童|喜剧

我想从上述数据集中提取三个字段作为电影 ID、标题和流派.我为此编写了以下代码

movies = LOAD 'location/of/dataset/on/hdfs'使用 PigStorage('::')作为(电影ID:int,title:chararray,genre:chararray);

但我收到以下错误

错误 org.apache.pig.tools.grunt.Grunt - 错误 1200:Pig 脚本无法解析:<file script.pig, line 1, column 9>猪脚本无法验证:java.lang.RuntimeException:无法使用参数[::]"实例化PigStorage"

解决方案

使用 MyRegExloader:为此您需要 piggybank.jar.

REGISTER '/path/to/piggybank.jar'A = LOAD '/path/to/dataset' 使用 org.apache.pig.piggybank.storage.MyRegExLoader('([^\\:]+)::([^\\:]+)::([^\\:]+)')as (movieid:int, title:chararray, 流派:chararray);

<块引用>

输出:

(1,玩具总动员(1995),动画|儿童|喜剧)

Following is a sample dataset delimited by double colon(::).

1::Toy Story (1995)::Animation|Children's|Comedy    

I want to extract three fields from above data set as movieID,title and genre. I have written following code for that

movies = LOAD 'location/of/dataset/on/hdfs ' 
using PigStorage('::')
as 
(MovieID:int,title:chararray,genre:chararray);  

But i am getting following error

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to  parse:  
 <file script.pig, line 1, column 9> pig script failed to validate:
 java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[::]' 

解决方案

Use MyRegExloader: You will need piggybank.jar for this.

REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('([^\\:]+)::([^\\:]+)::([^\\:]+)') 
      as (movieid:int, title:chararray, genre:chararray);

Output :

(1,Toy Story (1995),Animation|Children's|Comedy)

这篇关于加载由双冒号分隔的文件 :: in pig的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆