Avro的模式引发StructType [英] Avro Schema to spark StructType

查看:849
本文介绍了Avro的模式引发StructType的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是有效地同我的 previous问题 ,但使用阿夫罗而非JSON作为数据格式。

This is effectively the same as my previous question, but using Avro rather than JSON as the data format.

我用一个Spark数据帧它可以从几个不同的模式版本之一加载数据时

I'm working with a Spark dataframe which could be loading data from one of a few different schema versions:

// Version One
{"namespace": "com.example.avro",
 "type": "record",
 "name": "MeObject",
 "fields": [
     {"name": "A", "type": ["null", "int"], "default": null}
 ]
}

// Version Two
{"namespace": "com.example.avro",
 "type": "record",
 "name": "MeObject",
 "fields": [
     {"name": "A", "type": ["null", "int"], "default": null},
     {"name": "B", "type": ["null", "int"], "default": null}
 ]
}

我使用星火Avro的来加载数据。

DataFrame df = context.read()
  .format("com.databricks.spark.avro")
  .load("path/to/avro/file");

其可以是一个版本的一个文件或版两个文件。不过,我希望能在一个相同的庄园来处理它,设定为空的未知值。在我的previous问题的建议是设置的模式,但我不想重复自己写的模式中既有 .avro 文件和火花 StructType 和朋友。我怎么能转换的Avro架构(无论是文本文件或生成的 MeObject.getClassSchema())到火花 StructType

which may be a Version One file or Version Two file. However I'd like to be able to process it in an identical manor, with the unknown values set to "null". The recommendation in my previous question was to set the schema, however I do not want to repeat myself writing the schema in both a .avro file and as sparks StructType and friends. How can I convert the avro schema (either text file or the generated MeObject.getClassSchema()) into sparks StructType?

星火的Avro有一个<一个href=\"https://github.com/databricks/spark-avro/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala\"相对=nofollow> SchemaConverters ,但它是所有私人和返回一些奇怪的内部对象。

Spark Avro has a SchemaConverters, but it is all private and returns some strange internal object.

推荐答案

免责声明:这是怎样的一个肮脏的黑客。这取决于几件事情:

Disclaimer: It's kind of a dirty hack. It depends on a few things:


  • Python提供一个轻量级的Avro处理库并由于其活力没有关系'T需要输入的作家

  • 一个空的Avro文件仍然是有效的文档

  • 火花模式可以转换到和从JSON

  • Python provides a lightweight Avro processing library and due to its dynamism it doesn't require typed writers
  • an empty Avro file is still a valid document
  • Spark schema can be converted to and from JSON

继code读取的Avro架构文件,创建一个具有给定的模式一个空的Avro文件,并使用读取它火花CSV 和输出星火模式作为一个JSON文件

Following code reads an Avro schema file, creates an empty Avro file with given schema, reads it using spark-csv and outputs Spark schema as a JSON file.

import argparse
import tempfile

import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter

from pyspark import SparkContext
from pyspark.sql import SQLContext

def parse_schema(schema):
    with open(schema) as fr:
        return avro.schema.parse(open(schema).read())

def write_dummy(schema):
    tmp = tempfile.mktemp(suffix='.avro')
    with open(tmp, "w") as fw:
        writer = DataFileWriter(fw, DatumWriter(), schema)
        writer.close()
    return tmp

def write_spark_schema(path, schema):
    with open(path, 'w') as fw:
        fw.write(schema.json())


def main():
    parser = argparse.ArgumentParser(description='Avro schema converter')
    parser.add_argument('--schema')
    parser.add_argument('--output')
    args = parser.parse_args()

    sc = SparkContext('local[1]', 'Avro schema converter')
    sqlContext = SQLContext(sc)

    df = (sqlContext.read.format('com.databricks.spark.avro')
            .load(write_dummy(parse_schema(args.schema))))

    write_spark_schema(args.output, df.schema)
    sc.stop()


if __name__ == '__main__':
    main()

用法:

bin/spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 \ 
   avro_to_spark_schema.py \
   --schema path_to_avro_schema.avsc \
   --output path_to_spark_schema.json

阅读模式:

import scala.io.Source
import org.apache.spark.sql.types.{DataType, StructType}

val json: String = Source.fromFile("schema.json").getLines.toList.head
val schema: StructType = DataType.fromJson(json).asInstanceOf[StructType]

这篇关于Avro的模式引发StructType的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆