猪 - 将数据从行转换为列，同时为特定行中不存在的字段插入占位符 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

查看：18 发布时间：2021/11/12 4:13:17 apache-pig placeholder transpose

本文介绍了猪 - 将数据从行转换为列，同时为特定行中不存在的字段插入占位符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我在 HDFS 上有以下平面文件(我们称之为 key_value):

Suppose I have the following flat file on HDFS (let's call this key_value):

1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing

这是我正在寻找的输出:

Here is the output I'm looking for:

(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)

换句话说，前两列形成一个唯一标识符(类似于 db 术语中的复合键)，对于该标识符的给定值，我们希望输出中有一行(即最后两列 - 其中是有效的键值对 - 只要标识符相同就会压缩到同一行).还要注意第二行中的空值，为当唯一标识符为 (2, 1) 时缺失的主管块添加占位符.

In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).

为此，我开始整理这个猪脚本:

Towards this end, I started putting together this pig script:

data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
   sorted = ORDER data BY key, value;
   GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;

上面的脚本给了我以下输出:

The above script gives me the following output:

(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)

请注意，缺少主管的空占位符未出现在第二条记录中(这是预期的).如果我可以将这些空值设置到位，那么摆脱冗余列似乎只是另一个投影问题(前两个重复多次 - 每个键值对一次).

Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).

不使用 UDF，有没有办法使用内置函数在 pig 中完成此操作?

Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?

更新: 正如 WinnieNicklaus 正确指出的那样，输出中的名称是多余的.所以输出可以压缩为:

UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:

(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)

猪 - 将数据从行转换为列，同时为特定行中不存在的字段插入占位符 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

猪 - 将数据从行转换为列，同时为特定行中不存在的字段插入占位符 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭