PySpark,通过JSON文件导入架构 [英] PySpark, importing schema through JSON file

查看:142
本文介绍了PySpark,通过JSON文件导入架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

tbschema.json看起来像这样:

[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}]

我使用以下代码加载

>>> df2 = sqlContext.jsonFile("tbschema.json")
>>> f2.schema
StructType(List(StructField(ACCOUNT,StringType,true),
    StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true)))
>>> df2.printSchema()
root
 |-- ACCOUNT: string (nullable = true)
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)

  1. 当我希望元素与JSON中出现的顺序相同时,为什么对模式元素进行排序.

  1. Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON.

在派生JSON之后,数据类型整数已转换为StringType,我该如何保留数据类型.

The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype.

推荐答案

当我希望元素与json中出现的顺序相同时,为什么对架构元素进行排序.

Why does the schema elements gets sorted, when i want the elemets in the same order as they appear in the json.

因为不能保证字段的顺序.尽管没有明确说明,但当您看一下JSON阅读器doctstring中提供的示例时,它就会变得很明显.如果您需要特定的订购,则可以手动提供架构:

Because order of fields is not guaranteed. While it is not explicitly stated it becomes obvious when you take a look a the examples provided in the JSON reader doctstring. If you need specific ordering you can provide schema manually:

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("TICKET", StringType(), True),
    StructField("TRANFERRED", StringType(), True),
    StructField("ACCOUNT", StringType(), True),
])
df2 = sqlContext.read.json("tbschema.json", schema)
df2.printSchema()

root
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
 |-- ACCOUNT: string (nullable = true)

派生json后,数据类型整数已转换为StringType,我该如何保留数据类型.

The data type integer has been converted into StringType after the json has been derived, how do i retain the datatype.

JSON字段TICKET的数据类型为字符串,因此JSON阅读器返回字符串.它是JSON阅读器,而不是某种形式的阅读器.

Data type of JSON field TICKET is string hence JSON reader returns string. It is JSON reader not some-kind-of-schema reader.

通常来说,您应该考虑现成的模式支持随附的某些正确格式,例如 Parquet Avro

Generally speaking you should consider some proper format which comes with schema support out-of-the-box, for example Parquet, Avro or Protocol Buffers. But if you really want to play with JSON you can define poor man's "schema" parser like this:

from collections import OrderedDict 
import json

with open("./tbschema.json") as fr:
    ds = fr.read()

items = (json
  .JSONDecoder(object_pairs_hook=OrderedDict)
  .decode(ds)[0].items())

mapping = {"string": StringType, "integer": IntegerType, ...}

schema = StructType([
    StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])

JSON的问题在于,对于字段的排序确实没有任何保证,更不用说处理丢失的字段,类型不一致等了.因此,使用上述解决方案实际上取决于您对数据的信任程度.

Problem with JSON is that there is really no guarantee regarding fields ordering whatsoever, not to mention handling missing fields, inconsistent types and so on. So using solution as above really depends on how much you trust your data.

或者,您可以使用内置模式导入/导出实用程序.

这篇关于PySpark,通过JSON文件导入架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆