猪 - 将数据从行转换为列,同时为特定行中不存在的字段插入占位符 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

查看:18
本文介绍了猪 - 将数据从行转换为列,同时为特定行中不存在的字段插入占位符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在 HDFS 上有以下平面文件(我们称之为 key_value):

Suppose I have the following flat file on HDFS (let's call this key_value):

1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing

这是我正在寻找的输出:

Here is the output I'm looking for:

(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)

换句话说,前两列形成一个唯一标识符(类似于 db 术语中的复合键),对于该标识符的给定值,我们希望输出中有一行(即最后两列 - 其中是有效的键值对 - 只要标识符相同就会压缩到同一行).还要注意第二行中的空值,为当唯一标识符为 (2, 1) 时缺失的主管块添加占位符.

In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).

为此,我开始整理这个猪脚本:

Towards this end, I started putting together this pig script:

data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
   sorted = ORDER data BY key, value;
   GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;

上面的脚本给了我以下输出:

The above script gives me the following output:

(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)

请注意,缺少主管的空占位符未出现在第二条记录中(这是预期的).如果我可以将这些空值设置到位,那么摆脱冗余列似乎只是另一个投影问题(前两个重复多次 - 每个键值对一次).

Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).

不使用 UDF,有没有办法使用内置函数在 pig 中完成此操作?

Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?

更新: 正如 WinnieNicklaus 正确指出的那样,输出中的名称是多余的.所以输出可以压缩为:

UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:

(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)

推荐答案

首先,让我指出,如果对于大多数行,大部分列都没有填写,那么 IMO 更好的解决方案是使用地图.内置 TOMAP UDF 与 自定义 UDF 组合地图 将使您能够做到这一点.

First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.

我确信有一种方法可以通过计算所有可能键的列表来解决您的原始问题,用空值将其分解,然后丢弃也存在非空值的实例......但这会涉及很多 MR 周期,非常丑陋的代码,我怀疑这不比以其他方式组织数据更好.

I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.

您还可以编写一个 UDF 来接收一个键/值对包,另一个包所有可能的键,并生成您正在寻找的元组.这样会更清晰、更简单.

You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.

这篇关于猪 - 将数据从行转换为列,同时为特定行中不存在的字段插入占位符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆