猪改变模式为所需的类型 [英] Pig Changing Schema to required type

查看:77
本文介绍了猪改变模式为所需的类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名新的Pig用户。

我有一个我想修改的现有模式。我的源数据如下6列:

 名称类型日期地区运算值
------ -----------------------------------------------
john ab 20130106 DX 20
john ab 20130106 DC 19
jphn ab 20130106 DT 8
jphn ab 20130106 EC 854
jphn ab 20130106 ET 67
jphn ab 20130106 EX 98

等等。每个 Op 值总是 C T X



我基本上想以下列方式将我的数据分成7列:

 名称类型日期地区OpX OpC OpT 
---------------------- ------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67

基本上将 Op 列分成3列:每一列为 Op 值。这些列中的每一列都应包含来自列 Value 的适当值。



如何在Pig中执行此操作? p>

解决方案

实现预期结果的一种方法:

 <使用PigStorage(',')作为(名称:chararray,类型:chararray,
日期:int,区域:chararray,操作:chararray,值:int)使用code> IN = load'data.txt' ;
A =通过op asc命令IN;
B = group A by(name,type,date,region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value,','),',',3);
作为OpX:chararray,bs。$ 0作为OpC:chararray,bs。$ 1作为OpT:chararray生成flatten(组)作为(名称,类型,日期,区域),
bs。
}

描述C;
C:{name:chararray,type:chararray,date:int,region:chararray,OpX:
chararray,OpC:chararray,OpT:chararray}

dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)

更新:



如果您想跳过 它为计算添加了一个额外的 reduce 阶段,您可以在每个值前加上与其在 v 中相对应的op。然后使用自定义UDF 对元组字段进行排序具有期望的OpX,OpC,OpT订单:

 注册'myjar.jar'; 
A =使用PigStorage(',')加载'data.txt'(名称:chararray,类型:chararray,
日期:int,区域:chararray,op:chararray,值:int);
B = group A by(name,type,date,region);
C = foreach B {
v = foreach生成CONCAT(op,(chararray)value);
bs = STRSPLIT(BagToString(v,','),',',3);
生成flatten(group)as(name,type,date,region),
flatten(TupleArrange(bs))as(OpX:chararray,OpC:chararray,OpT:chararray);

其中 TupleArrange in mjar.jar是这样的:

  .. 
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class TupleArrange extends EvalFunc< Tuple> {

private static final TupleFactory tupleFactory = TupleFactory.getInstance();

@Override
public Tuple exec(Tuple input)throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple =(Tuple)input.get(0);
String [] tupleArr = new String [] {
(String)inputTuple.get(0),
(String)inputTuple.get(1),
(String)inputTuple .get(2)
};
Arrays.sort(tupleArr); //升序
result.set(0,tupleArr [2] .substring(1));
result.set(1,tupleArr [0] .substring(1));
result.set(2,tupleArr [1] .substring(1));
返回结果;
}
catch(Exception e){
抛出新的RuntimeException(TupleArrange error,e);
}
}

@Override
public Sc​​hema outputSchema(Schema input){
return input;
}
}


I'm a new Pig user.

I have an existing schema which I want to modify. My source data is as follows with 6 columns:

Name        Type    Date        Region    Op    Value
-----------------------------------------------------
john        ab      20130106    D         X     20
john        ab      20130106    D         C     19
jphn        ab      20130106    D         T     8
jphn        ab      20130106    E         C     854
jphn        ab      20130106    E         T     67
jphn        ab      20130106    E         X     98

and so on. Each Op value is always C, T or X.

I basically want to split my data in the following way into 7 columns:

Name        Type    Date        Region    OpX    OpC   OpT
----------------------------------------------------------
john        ab      20130106    D         20     19    8
john        ab      20130106    E         98     854   67

Basically split the Op column into 3 columns: each for one Op value. Each of these columns should contain appropriate value from column Value.

How can I do this in Pig?

解决方案

One way to achieve the desired result:

IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray, 
       date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
  bs = STRSPLIT(BagToString(A.value, ','),',',3);
  generate flatten(group) as (name, type, date, region), 
    bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}

describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX: 
    chararray,OpC: chararray,OpT: chararray}

dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)

Update:

If you want to skip order by which adds an additional reduce phase to the computation, you can prefix each value with its corresponding op in tuple v. Then sort the tuple fields by using a custom UDF to have the desired OpX, OpC, OpT order:

register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray, 
      date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
  v = foreach A generate CONCAT(op, (chararray)value);
  bs = STRSPLIT(BagToString(v, ','),',',3);
  generate flatten(group) as (name, type, date, region), 
    flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}

where TupleArrange in mjar.jar is something like this:

..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;

public class TupleArrange extends EvalFunc<Tuple> {

    private static final TupleFactory tupleFactory = TupleFactory.getInstance();

    @Override
    public Tuple exec(Tuple input) throws IOException {
        try {
            Tuple result = tupleFactory.newTuple(3);
            Tuple inputTuple = (Tuple) input.get(0);
            String[] tupleArr = new String[] { 
                    (String) inputTuple.get(0),
                    (String) inputTuple.get(1), 
                    (String) inputTuple.get(2) 
            };  
            Arrays.sort(tupleArr); //ascending
            result.set(0, tupleArr[2].substring(1));
            result.set(1, tupleArr[0].substring(1));
            result.set(2, tupleArr[1].substring(1));
            return result;
        }
        catch (Exception e) {
            throw new RuntimeException("TupleArrange error", e);
        }
    }

    @Override
    public Schema outputSchema(Schema input) {
        return input;
    }
}

这篇关于猪改变模式为所需的类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆