pig-在特定行中为不存在的字段插入占位符时,将数据从行转换为列 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

查看:136
本文介绍了pig-在特定行中为不存在的字段插入占位符时,将数据从行转换为列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在HDFS上有以下平面文件(我们称其为key_value):

Suppose I have the following flat file on HDFS (let's call this key_value):

1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing

这是我想要的输出:

(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)

换句话说,前两列构成一个唯一的标识符(类似于db术语中的复合键),并且对于此标识符的给定值,我们希望输出中有一行(即,后两列-是有效的键/值对-只要标识符相同,就会被压缩到同一行上).还要注意第二行中的空值,以添加唯一标识符为(2,1)时缺少的Supervisor件的占位符.

In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).

为此,我开始整理以下猪脚本:

Towards this end, I started putting together this pig script:

data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
   sorted = ORDER data BY key, value;
   GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;

上面的脚本为我提供了以下输出:

The above script gives me the following output:

(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)

请注意,缺少的Supervisor的空占位符不会出现在第二条记录中(这是预期的).如果我可以将这些null放置到位,那么摆脱多余的列(前两个被重复多次-每个键值对一次)似乎只是另一个计划.

Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).

使用UDF的时间短,有没有办法使用内置函数在Pig中完成此操作?

Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?

更新:正如WinnieNicklaus正确指出的那样,输出中的名称是多余的.因此输出可以压缩为:

UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:

(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)

推荐答案

首先,让我指出,如果对于大多数行,大多数列未填写,则IMO更好的解决方案是使用地图.内置的TOMAP UDF与结合使用自定义UDF来组合地图你这样做.

First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.

我确信有一种方法可以解决此问题,方法是计算所有可能的键的列表,用空值将其分解,然后扔掉还存在非空值的实例...但这会涉及许多MR周期,非常丑陋的代码,我怀疑这不比以其他方式组织数据更好.

I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.

您还可以编写一个UDF来容纳一袋键/值对,另一袋容纳所有可能的键,并生成您要查找的元组.这样会更清楚,更简单.

You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.

这篇关于pig-在特定行中为不存在的字段插入占位符时,将数据从行转换为列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆