avro模式中的可选数组 [英] optional array in avro schema

查看:605
本文介绍了avro模式中的可选数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以有一个可选数组. 让我们假设这样的模式:

I'm wondering whether or not it is possible to have an optional array. Let's assume a schema like this:

{ 
    "type": "record",
    "name": "test_avro",
    "fields" : [
        {"name": "test_field_1", "type": "long"},
        {"name": "subrecord", "type": [{
         "type": "record",
         "name": "subrecord_type",
           "fields":[{"name":"field_1", "type":"long"}]
          },"null"]
    },
    {"name": "simple_array",
    "type":{
        "type": "array",
        "items": "string"
      }
    }
  ]
}

尝试写入不带"simple_array"的avro记录将导致datafilewriter中的NPE. 对于子记录,这很好,但是当我尝试将数组定义为可选时:

Trying to write an avro record without "simple_array" would result in a NPE in the datafilewriter. For subrecord it's just fine, but when I try to define the array as optional:

{"name": "simple_array",
 "type":[{
   "type": "array",
   "items": "string"
   }, "null"]

它不会导致NPE,但会导致运行时异常:

It does not result in a NPE but a runtime exception:

AvroRuntimeException: Not an array schema: [{"type":"array","items":"string"},"null"]

谢谢.

推荐答案

我认为您想要的是null和数组的联合:

I think what you want here is a union of null and array:

{
    "type":"record",
    "name":"test_avro",
    "fields":[{
            "name":"test_field_1",
            "type":"long"
        },
        {
            "name":"subrecord",
            "type":[{
                    "type":"record",
                    "name":"subrecord_type",
                    "fields":[{
                            "name":"field_1",
                            "type":"long"
                        }
                    ]
                },
                "null"
            ]
        },
        {
            "name":"simple_array",
            "type":["null",
                {
                    "type":"array",
                    "items":"string"
                }
            ],
            "default":null
        }
    ]
}

当我在Python中使用上面的模式和示例数据时,结果如下(schema_string是上面的json字符串):

When I use the above schema with sample data in Python, here's the result (schema_string is the above json string):

>>> from avro import io, datafile, schema
>>> from json import dumps
>>> 
>>> sample_data = {'test_field_1':12L}
>>> rec_schema = schema.parse(schema_string)
>>> rec_writer = io.DatumWriter(rec_schema)
>>> rec_reader = io.DatumReader()
>>> 
>>> # write avro file
... df_writer = datafile.DataFileWriter(open("/tmp/foo", 'wb'), rec_writer, writers_schema=rec_schema)
>>> df_writer.append(sample_data)
>>> df_writer.close()
>>> 
>>> # read avro file
... df_reader = datafile.DataFileReader(open('/tmp/foo', 'rb'), rec_reader)
>>> print dumps(df_reader.next())
{"simple_array": null, "test_field_1": 12, "subrecord": null}

这篇关于avro模式中的可选数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆