猪铸造/数据类型 [英] Pig casting / datatypes

查看:31
本文介绍了猪铸造/数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将关系转储到 AVRO 文件中,但出现一个奇怪的错误:

org.apache.pig.data.DataByteArray 不能转换为 java.lang.CharSequence

我不使用DataByteArray(字节数组),见下面的关系描述.

sensitiveSet: {rank_ID: long,name: chararray,customerId: long,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray}

即使我进行了显式转换,我也会遇到同样的错误:

sensitiveSet = foreachsensitiveSet generate (long) $0, (chararray) $1, (long) $2, (chararray) $3, (chararray) $4, (chararray) $5, (chararray) $6;STOREsensitiveSet INTO 'testOut2222.avro'使用 org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check', 'schema', '{"type":"record","name":"xxxx","namespace":"","fields":[{"name":"rank_ID","type":"long"},{"name":"name","type":"string","store":"no","sensitive":"na"},{"name":"customerId","type":"string","store":"yes","sensitive":"yes"},{"name":"VIN","type":"string","store":"yes","sensitive":"yes"},{"name":"birth_date","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_mileage","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_consumption","type":"string","store":"yes","sensitive":"no"}]}');

我正在尝试定义一个输出模式,它应该是一个包含另外两个元组的元组,即 stats:tuple(c:tuple(),d:tuple).

下面的代码无法正常工作.它以某种方式产生如下结构:

stats:tuple(b:tuple(c:tuple(),d:tuple()))

下面是describe产生的输出.

sourceData: {com.mortardata.pig.dataspliter_36: (stats: ((name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),(name:chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray)))}

是否可以创建如下结构,这意味着我需要从前面的示例中删除元组 b.

咕噜声>描述源数据;sourceData: {t: (s: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),n: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray))}

下面的代码没有按预期工作.

 public Sc​​hema outputSchema(Schema input) {Schema sensTuple = new Schema();sensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));sensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));sensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));sensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));sensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));sensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));Schema nonSensTuple = new Schema();nonSensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));nonSensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));nonSensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));nonSensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));nonSensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));nonSensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));Schema parentTuple = new Schema();parentTuple.add(new Schema.FieldSchema(null, sensTuple, DataType.TUPLE));parentTuple.add(new Schema.FieldSchema(null, nonSensTuple, DataType.TUPLE));Schema outputSchema = new Schema();outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));返回新架构(新架构.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),输入),输出架构,数据类型.TUPLE));

UDF 的 exec 方法返回:

public Tuple exec(Tuple tuple) 抛出 IOException {元组 parentTuple = mTupleFactory.newTuple();parentTuple.append(tuple1);parentTuple.append(tuple2);

EDIT2(已修复)

<预><代码>...Schema outputSchema = new Schema();outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));

<删除>返回新架构(新架构.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),输入),输出架构,DataType.TUPLE);

return outputSchema;

现在我从 UDF 返回正确的模式,其中所有项目都是字符数组,但是当我尝试将这些项目作为 type: string 存储到 avro 文件中时,我得到了相同的错误:

java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: org.apache.pig.data.DataByteArray 不能转换为 java.lang.CharSequence在 org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)在 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)

已解决:好的,问题是数据没有被转换为 UDF 主体内的正确类型 - exec () 方法.看起来现在可以用了!

解决方案

通常这意味着您使用的 UDF 没有保留架构,或者在某个地方丢失了架构.我相信 DataByteArray 是真实类型未知时的后备类型.您可能需要添加种姓来解决此问题,但更好的解决方案是修复任何 UDF 删除架构的问题.

I'm trying to dump relation into AVRO file but I'm getting a strange error:

org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence

I don't use DataByteArray (bytearray), see description of the relation below.

sensitiveSet: {rank_ID: long,name: chararray,customerId: long,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray}

Even when I do explicit casting I get the same error:

sensitiveSet = foreach sensitiveSet generate (long) $0, (chararray) $1, (long) $2, (chararray) $3, (chararray) $4, (chararray) $5, (chararray) $6;

STORE sensitiveSet INTO 'testOut2222.avro'
USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check', 'schema', '{"type":"record","name":"xxxx","namespace":"","fields":[{"name":"rank_ID","type":"long"},{"name":"name","type":"string","store":"no","sensitive":"na"},{"name":"customerId","type":"string","store":"yes","sensitive":"yes"},{"name":"VIN","type":"string","store":"yes","sensitive":"yes"},{"name":"birth_date","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_mileage","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_consumption","type":"string","store":"yes","sensitive":"no"}]}');

EDITED:

I'm trying to define an output schema which should be a Tuple that contains another two tuples, i.e. stats:tuple(c:tuple(),d:tuple).

The code below doesn't work as it was intended. It somehow produces structure as:

stats:tuple(b:tuple(c:tuple(),d:tuple()))

Below is output produced by describe.

sourceData: {com.mortardata.pig.dataspliter_36: (stats: ((name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),(name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray)))}

Is it possible to create structure as below, which means I need to remove the tuple b from the previous example.

grunt> describe sourceData;
sourceData: {t: (s: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),n: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray))}

The below code doesn't work as expected.

  public Schema outputSchema(Schema input) {
    Schema sensTuple = new Schema();
    sensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));

    Schema nonSensTuple = new Schema();
    nonSensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));


    Schema parentTuple = new Schema();
    parentTuple.add(new Schema.FieldSchema(null, sensTuple, DataType.TUPLE));
    parentTuple.add(new Schema.FieldSchema(null, nonSensTuple, DataType.TUPLE));


    Schema outputSchema = new Schema();
    outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));

    return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), outputSchema, DataType.TUPLE));

The UDF's exec method returns:

public Tuple exec(Tuple tuple) throws IOException {    
  Tuple parentTuple  = mTupleFactory.newTuple();
  parentTuple.append(tuple1);
  parentTuple.append(tuple2);

EDIT2 (FIXED)

...
Schema outputSchema = new Schema();
outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));

return new Schema (new Schema.FieldSchema (getSchemaName (this.getClass ().getName ().toLowerCase (), input), outputSchema, DataType.TUPLE);

return outputSchema;

Now I return proper schema from UDF where all items are chararray but when I try to store those items into avro file as type: string I got the same error:

java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)

SOLVED: Ok, the issue was that data wasnt casted to the proper type inside the UDF body - exec () method. Looks like it works now!

解决方案

Usually this means you are using a UDF that isn't preserving the schema, or somewhere it is getting lost. I believe DataByteArray is the fallback type when the real type isn't known. You may need to add a caste to workaround this, however a better solution is to fix whatever UDF is dropping the schema.

这篇关于猪铸造/数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆