pig-在特定行中为不存在的字段插入占位符时，将数据从行转换为列 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

查看：136 发布时间：2020/5/28 1:46:35 apache-pig placeholder transpose

本文介绍了pig-在特定行中为不存在的字段插入占位符时，将数据从行转换为列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我在HDFS上有以下平面文件(我们称其为key_value):

Suppose I have the following flat file on HDFS (let's call this key_value):

1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing

这是我想要的输出:

(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)

换句话说，前两列构成一个唯一的标识符(类似于db术语中的复合键)，并且对于此标识符的给定值，我们希望输出中有一行(即，后两列-是有效的键/值对-只要标识符相同，就会被压缩到同一行上).还要注意第二行中的空值，以添加唯一标识符为(2，1)时缺少的Supervisor件的占位符.

In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).

为此，我开始整理以下猪脚本:

Towards this end, I started putting together this pig script:

data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
   sorted = ORDER data BY key, value;
   GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;

上面的脚本为我提供了以下输出:

The above script gives me the following output:

(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)

请注意，缺少的Supervisor的空占位符不会出现在第二条记录中(这是预期的).如果我可以将这些null放置到位，那么摆脱多余的列(前两个被重复多次-每个键值对一次)似乎只是另一个计划.

Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).

使用UDF的时间短，有没有办法使用内置函数在Pig中完成此操作?

Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?

更新:正如WinnieNicklaus正确指出的那样，输出中的名称是多余的.因此输出可以压缩为:

UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:

(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)

pig-在特定行中为不存在的字段插入占位符时，将数据从行转换为列 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pig-在特定行中为不存在的字段插入占位符时，将数据从行转换为列 [英] pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭