只加载PIG中的特定字段? [英] Load only particular field in PIG?

查看:98
本文介绍了只加载PIG中的特定字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的档案:

Col1,Col2,Col3,Col4,Col5



我只需要 Col2 Col3



目前我正在这样做:

  a = load'input'as(Col1:chararray ,
Col2:chararray,
Col3:chararray,
Col4:chararray);
b = foreach生成Col2,Col3;

有没有办法直接加载 Col2 Col3 ,而不是加载整个 input ,然后 generate c>所需的列?

解决方案

您的方法只有 GENERATE ing你想要的列是根据你的要求做的有效方法。请记住,所有数据都存储在HDFS上,并且在启动脚本时并未将它们全部加载到内存中。即使您没有在处理中使用这些字节,您仍然必须从磁盘读取这些字节,因此没有性能优势,永不加载该数据。好处是永远不必将它发送到你已经用你的方法完成的reducer。



在Pig可以告诉列不会被使用的情况下,它会立即修剪它,基本上为你做了什么,你用你的 b = foreach生成了Col2,Col3; 。但是,如果您使用可能访问其他字段的UDF,则不会发生这种情况,因为Pig不会在UDF内部查看它们是否已被使用。例如,假设 Col3 int 。如果你有

  b = 
c = foreach b生成组,SUM(a.Col3);

然后Pig会自动修剪你的第1和第4列,因为它可以看到它们从来没有用过的。但是,如果您改为

  b =通过Col2分组a; 
c = foreach b生成组,COUNT(a);

然后Pig不能修剪,因为它看不见 COUNT UDF并不知道其他字段不会被使用。如果不确定Pig是否会进行修剪,可以使用您已有的 foreach / generate 方法。如果您的脚本列出了所有可删除的列,Pig应打印诊断信息。



如果您的问题是您不想当您只对几列感兴趣时,必须提供完整的模式,您可以完全跳过模式,并将其放入 GENERATE

  a = load'input'; 
b = foreach生成(chararray)$ 1作为Col2,(chararray)$ 2作为Col3;


This is my file:

Col1, Col2, Col3, Col4, Col5

I need only Col2 and Col3.

Currently I'm doing this:

a = load 'input' as (Col1:chararray, 
                     Col2:chararray, 
                     Col3:chararray, 
                     Col4:chararray);
b = foreach a generate Col2, Col3;

Is there a way to do directly load only Col2 and Col3 instead of loading the whole input and then generate required columns?

解决方案

Your method of only GENERATEing the columns you want is an effective way to do just what you ask. Remember that all of your data is stored on HDFS, and you're not loading it all into memory when you start your script. You still will have to read those bytes off the disk even if you are not keeping them around for use in your processing, so there is no performance advantage to never loading that data. The advantage comes in never having to send it to a reducer, which you have accomplished with your method.

In cases where Pig can tell that a column won't be used, it will "prune" it immediately, essentially doing for you what you did with your b = foreach a generate Col2, Col3;. This won't happen, however, if you are using a UDF that might access other fields, because Pig doesn't look inside the UDF to see if they get used. For example, suppose Col3 is an int. If you have

b = group a by Col2;
c = foreach b generate group, SUM(a.Col3);

then Pig will automatically prune the 1st and 4th columns for you, since it can see they're never used. However, if you instead did

b = group a by Col2;
c = foreach b generate group, COUNT(a);

then Pig can't prune, because it doesn't see inside the COUNT UDF and doesn't know that the other fields won't be used. When in doubt of whether Pig will do this pruning, you can use the foreach/generate method you already have. And Pig should print a diagnostic message when you start your script listing all the columns it was able to prune out.

If instead your problem is that you don't want to have to provide a full schema when you're interested in just a few columns, you can skip the schema entirely and put it in the GENERATE:

a = load 'input';
b = foreach a generate (chararray) $1 as Col2, (chararray) $2 as Col3;

这篇关于只加载PIG中的特定字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆