猪改变模式为所需的类型 [英] Pig Changing Schema to required type
问题描述
我是一名新的Pig用户。
我有一个我想修改的现有模式。我的源数据如下6列:
名称类型日期地区运算值
------ -----------------------------------------------
john ab 20130106 DX 20
john ab 20130106 DC 19
jphn ab 20130106 DT 8
jphn ab 20130106 EC 854
jphn ab 20130106 ET 67
jphn ab 20130106 EX 98
等等。每个 Op
值总是 C
, T
或 X
。
我基本上想以下列方式将我的数据分成7列:
名称类型日期地区OpX OpC OpT
---------------------- ------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
基本上将 Op
列分成3列:每一列为 Op
值。这些列中的每一列都应包含来自列 Value
的适当值。
如何在Pig中执行此操作? p>
实现预期结果的一种方法:
<使用PigStorage(',')作为(名称:chararray,类型:chararray,
日期:int,区域:chararray,操作:chararray,值:int)使用code> IN = load'data.txt' ;
A =通过op asc命令IN;
B = group A by(name,type,date,region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value,','),',',3);
作为OpX:chararray,bs。$ 0作为OpC:chararray,bs。$ 1作为OpT:chararray生成flatten(组)作为(名称,类型,日期,区域),
bs。
}
描述C;
C:{name:chararray,type:chararray,date:int,region:chararray,OpX:
chararray,OpC:chararray,OpT:chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
更新:
如果您想跳过
它为计算添加了一个额外的 reduce 阶段,您可以在每个值前加上与其在
注册'myjar.jar';
A =使用PigStorage(',')加载'data.txt'(名称:chararray,类型:chararray,
日期:int,区域:chararray,op:chararray,值:int);
B = group A by(name,type,date,region);
C = foreach B {
v = foreach生成CONCAT(op,(chararray)value);
bs = STRSPLIT(BagToString(v,','),',',3);
生成flatten(group)as(name,type,date,region),
flatten(TupleArrange(bs))as(OpX:chararray,OpC:chararray,OpT:chararray);
其中 TupleArrange
in mjar.jar是这样的:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc< Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
@Override
public Tuple exec(Tuple input)throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple =(Tuple)input.get(0);
String [] tupleArr = new String [] {
(String)inputTuple.get(0),
(String)inputTuple.get(1),
(String)inputTuple .get(2)
};
Arrays.sort(tupleArr); //升序
result.set(0,tupleArr [2] .substring(1));
result.set(1,tupleArr [0] .substring(1));
result.set(2,tupleArr [1] .substring(1));
返回结果;
}
catch(Exception e){
抛出新的RuntimeException(TupleArrange error,e);
}
}
@Override
public Schema outputSchema(Schema input){
return input;
}
}
I'm a new Pig user.
I have an existing schema which I want to modify. My source data is as follows with 6 columns:
Name Type Date Region Op Value
-----------------------------------------------------
john ab 20130106 D X 20
john ab 20130106 D C 19
jphn ab 20130106 D T 8
jphn ab 20130106 E C 854
jphn ab 20130106 E T 67
jphn ab 20130106 E X 98
and so on. Each Op
value is always C
, T
or X
.
I basically want to split my data in the following way into 7 columns:
Name Type Date Region OpX OpC OpT
----------------------------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
Basically split the Op
column into 3 columns: each for one Op
value. Each of these columns should contain appropriate value from column Value
.
How can I do this in Pig?
One way to achieve the desired result:
IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value, ','),',',3);
generate flatten(group) as (name, type, date, region),
bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}
describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
chararray,OpC: chararray,OpT: chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
Update:
If you want to skip order by
which adds an additional reduce phase to the computation, you can prefix each value with its corresponding op in tuple v. Then sort the tuple fields by using a custom UDF to have the desired OpX, OpC, OpT order:
register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
v = foreach A generate CONCAT(op, (chararray)value);
bs = STRSPLIT(BagToString(v, ','),',',3);
generate flatten(group) as (name, type, date, region),
flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}
where TupleArrange
in mjar.jar is something like this:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc<Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
@Override
public Tuple exec(Tuple input) throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple = (Tuple) input.get(0);
String[] tupleArr = new String[] {
(String) inputTuple.get(0),
(String) inputTuple.get(1),
(String) inputTuple.get(2)
};
Arrays.sort(tupleArr); //ascending
result.set(0, tupleArr[2].substring(1));
result.set(1, tupleArr[0].substring(1));
result.set(2, tupleArr[1].substring(1));
return result;
}
catch (Exception e) {
throw new RuntimeException("TupleArrange error", e);
}
}
@Override
public Schema outputSchema(Schema input) {
return input;
}
}
这篇关于猪改变模式为所需的类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!