在猪中避免多关系联接中的前缀 [英] avoiding prefixes in multi relation join in pig

查看:45
本文介绍了在猪中避免多关系联接中的前缀的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Pig中进行星型联接,下面是我的代码.当我用不同的列联接多个关系时,我每次都必须给前一个联接的名称加上前缀才能使其正常工作.我确信应该有更好的方法,但是我无法通过谷歌搜索找到它.任何指针都将非常有帮助.

I am trying to do a star schema type of join in pig and below is my code. When I join multiple relations with different columns, I have to prefix the name of the previous join every time to get it working. I am sure there should be some better way, I am not able to find it through googling. Any pointers will be very helpful.

我想避免像这样的列作为前缀"H864 :: H86 :: hs_8_d :: hs_8_desc".

i.e prefixing a column like this "H864::H86::hs_8_d::hs_8_desc" is what I want to avoid.

hs_8 = LOAD 'hs_8_distinct' USING PigStorage('^') as (hs_8:chararray,hs_8_desc:chararray);
hs_8_d = FOREACH hs_8 GENERATE SUBSTRING(hs_8,0,2) as hs_2,SUBSTRING(hs_8,0,4) as hs_4,SUBSTRING(hs_8,0,6) as hs_6,hs_8,hs_8_desc;

hs_6_d = LOAD 'hs_6_distinct' USING PigStorage('^') as (hs_6:chararray,hs_6_desc:chararray);
hs_4_d = LOAD 'hs_4_distinct' USING PigStorage('^') as (hs_4:chararray,hs_4_desc:chararray);
hs_2_d = LOAD 'hs_2_distinct' USING PigStorage('^') as (hs_2:chararray,hs_2_desc:chararray);

H86 = JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated' ;
H864 = JOIN H86 BY hs_8_d::hs_4, hs_4_d BY hs_4 USING 'replicated' ;
H8642 = JOIN H864 BY H86::hs_8_d::hs_2, hs_2_d BY hs_2 USING 'replicated' ;

hs_dim = FOREACH H8642 GENERATE hs_2_d::hs_2,hs_2_d::hs_2_desc,H864::hs_4_d::hs_4,H864::hs_4_d::hs_4_desc,H864::H86::hs_6_d::hs_6,H864::H86::hs_6_d::hs_6_desc,H864::H86::hs_8_d::hs_8,H864::H86::hs_8_d::hs_8_desc;

推荐答案

通过在联接中添加额外的foreach,可以稍微简化别名.检查统计信息,这不会在管道中添加额外的MR作业.原始文档将产生4个仅地图的作业.

By adding extra foreach to the joins you can slightly simplify the aliases. Check the statistics, this won't add extra MR jobs to the pipeline. The original and this will yield to 4 map-only jobs.

例如:

H86 = foreach (JOIN hs_8_d BY hs_6, hs_6_d BY hs_6 USING 'replicated') generate 
        hs_8_d::hs_2 as x1, 
        hs_8_d::hs_4 as x2, 
        hs_8_d::hs_6 as x3, 
        hs_8_d::hs_8 as x4,
        hs_8_d::hs_8_desc as x5, 
        hs_6_d::hs_6 as x6,
        hs_6_d::hs_6_desc as x7;

H864 = foreach (JOIN H86 BY x2, hs_4_d BY hs_4 USING 'replicated') generate 
          H86::x1 as y1,
          H86::x2 as y2, 
          H86::x3 as y3,
          H86::x4 as y4,
          H86::x5 as y5, 
          H86::x6 as y6, 
          H86::x7 as y7,
          hs_4_d::hs_4 as y8,
          hs_4_d::hs_4_desc as y9;

H8642 = foreach (JOIN H864 BY y1, hs_2_d BY hs_2 USING 'replicated') generate 
          H864::y1 as z1, 
          H864::y2 as z2,
          H864::y3 as z3, 
          H864::y4 as z4, 
          H864::y5 as z5, 
          H864::y6 as z6, 
          H864::y7 as z7,
          H864::y8 as z8, 
          H864::y9 as z9, 
          hs_2_d::hs_2 as z10, 
          hs_2_d::hs_2_desc as z11;

hs_dim = FOREACH H8642 GENERATE z10, z11, z8, z9, z6, z7, z4, z5;

如果您有一袋元组,则 Datafu

If you have a bag of tuples, then Datafu's AliasBagFields may be helpful.

这篇关于在猪中避免多关系联接中的前缀的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆