Apache Pig没有完全解析元组 [英] Apache Pig not parsing a tuple fully

查看:86
本文介绍了Apache Pig没有完全解析元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为 data 的文件,如下所示:(注意'personA'后面有标签)

  personA(1,2,3)
personB(2,1,34)



我有这样的Apache猪脚本:

  A = LOAD'data'AS(name:chararray,nodes:tuple(a:int,b:int,c:int)); 
C = foreach生成节点$ 0;
dump C;

输出有意义:

(1)
(2)

然而,如果我将脚本的模式更改为这样:

  A = LOAD'data'AS(名称:chararray,nodes :tuple()); 
C = foreach生成节点$ 0;
dump C;

然后我得到的输出是这样的:

 (1,2,3)
(2,1,34)

它看起来像这个元组中的第一个(也是唯一的)元素是一个字节数组。即它不会将输入文本 1,2,3 解析为元组。



将来我的输入将会有一个未知的& 节点项中的可变元素个数,所以我不能只写出 a:int,...



有没有办法让Pig将输入元组解析为一个元组而不必写出完整的模式?

解决方案

猪不接受你传递的有效信息。默认加载方案PigStorage只接受分隔文件(默认情况下制表符分隔)。使用括号和文本中的逗号解析元组结构并不够智能。您的选项是:


  • 重新格式化文件以制表符分隔: personA 1 2 3

  • UDF ,用于解析行并以所需的形式返回数据。 pig.apache.org/docs/r0.9.1/udf.html#load-store-functionsrel =nofollow> custom loader 。

I have a file called data that looks like this: (note there are tabs after the 'personA')

personA (1, 2, 3)
personB (2, 1, 34)

And I have an Apache pig script like this:

A = LOAD 'data' AS (name: chararray, nodes: tuple(a:int, b:int, c:int));
C = foreach A generate nodes.$0;
dump C;

The output of which makes sense:

(1)
(2)

However if I change the schema of the script to be like this:

A = LOAD 'data' AS (name: chararray, nodes: tuple());
C = foreach A generate nodes.$0;
dump C;

Then the output I get is this:

(1, 2, 3)
(2, 1, 34)

It looks like the first (and only) element in this tuple is a bytearray. i.e. it's not parsing the input text 1, 2, 3 into a tuple.

In future my input will have an unknown & variable number of elements in the nodes item, so I can't just write out a:int, ….

Is there anyway to get Pig to parse the input tuple as a tuple without having to write out the full schema?

解决方案

Pig does not accept what you are passing in as valid. The default loading scheme PigStorage only accepts delimited files (by default tab delimited). It is not smart enough to parse the tuple construct with the parenthesis and commas you have in the text. Your options are:

  • Reformat your file to be tab delimited: personA 1 2 3
  • Read the file in line by line with TextLoader, then write some sort of UDF that parses the line and returns the data in the form you want.
  • Write your own custom loader.

这篇关于Apache Pig没有完全解析元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆