PySpark,通过 JSON 文件导入模式 [英] PySpark, importing schema through JSON file

查看:28
本文介绍了PySpark,通过 JSON 文件导入模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

tbschema.json 看起来像这样:

[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}]

我使用以下代码加载它

<预><代码>>>>df2 = sqlContext.jsonFile("tbschema.json")>>>f2.schemaStructType(List(StructField(ACCOUNT,StringType,true),StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true)))>>>df2.printSchema()根|-- 帐户:字符串(可为空 = 真)|-- TICKET: string (nullable = true)|-- 转移:字符串(可为空 = 真)

  1. 为什么要对架构元素进行排序,因为我希望元素的顺序与它们在 JSON 中的显示顺序相同.

  2. JSON导出后数据类型integer已经转成StringType,如何保留数据类型.

解决方案

为什么要对架构元素进行排序,因为我希望元素的顺序与它们在 json 中出现的顺序相同.

因为不能保证字段的顺序.虽然没有明确说明,但当您查看 JSON 阅读器文档字符串中提供的示例时,它变得显而易见.如果您需要特定的排序,您可以手动提供架构:

from pyspark.sql.types import StructType, StructField, StringType架构 = 结构类型([StructField("TICKET", StringType(), True),StructField("TRANSFERRED", StringType(), True),StructField("ACCOUNT", StringType(), True),])df2 = sqlContext.read.json("tbschema.json", schema)df2.printSchema()根|-- TICKET: string (nullable = true)|-- 转移:字符串(可为空 = 真)|-- 帐户:字符串(可为空 = 真)

<块引用>

json导出后数据类型integer已经转成StringType,如何保留数据类型.

JSON 字段的数据类型 TICKET 是字符串,因此 JSON 读取器返回字符串.它是 JSON 读取器,而不是某种模式读取器.

一般来说,您应该考虑开箱即用的模式支持附带的一些正确格式,例如 ParquetAvro协议缓冲区.但是如果你真的想玩 JSON 你可以像这样定义穷人的模式"解析器:

from collections import OrderedDict导入json使用 open("./tbschema.json") 作为 fr:ds = fr.read()项目 = (json.JSONDecoder(object_pairs_hook=OrderedDict).decode(ds)[0].items())mapping = {"string": StringType, "integer": IntegerType, ...}架构 = 结构类型([StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])

JSON 的问题在于,对于字段的排序确实没有任何保证,更不用说处理缺失的字段、不一致的类型等等.因此,使用上述解决方案实际上取决于您对数据的信任程度.

或者,您可以使用内置架构导入/导出实用程序.

tbschema.json looks like this:

[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}]

I load it using following code

>>> df2 = sqlContext.jsonFile("tbschema.json")
>>> f2.schema
StructType(List(StructField(ACCOUNT,StringType,true),
    StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true)))
>>> df2.printSchema()
root
 |-- ACCOUNT: string (nullable = true)
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)

  1. Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON.

  2. The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype.

解决方案

Why does the schema elements gets sorted, when i want the elemets in the same order as they appear in the json.

Because order of fields is not guaranteed. While it is not explicitly stated it becomes obvious when you take a look a the examples provided in the JSON reader doctstring. If you need specific ordering you can provide schema manually:

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("TICKET", StringType(), True),
    StructField("TRANFERRED", StringType(), True),
    StructField("ACCOUNT", StringType(), True),
])
df2 = sqlContext.read.json("tbschema.json", schema)
df2.printSchema()

root
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
 |-- ACCOUNT: string (nullable = true)

The data type integer has been converted into StringType after the json has been derived, how do i retain the datatype.

Data type of JSON field TICKET is string hence JSON reader returns string. It is JSON reader not some-kind-of-schema reader.

Generally speaking you should consider some proper format which comes with schema support out-of-the-box, for example Parquet, Avro or Protocol Buffers. But if you really want to play with JSON you can define poor man's "schema" parser like this:

from collections import OrderedDict 
import json

with open("./tbschema.json") as fr:
    ds = fr.read()

items = (json
  .JSONDecoder(object_pairs_hook=OrderedDict)
  .decode(ds)[0].items())

mapping = {"string": StringType, "integer": IntegerType, ...}

schema = StructType([
    StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])

Problem with JSON is that there is really no guarantee regarding fields ordering whatsoever, not to mention handling missing fields, inconsistent types and so on. So using solution as above really depends on how much you trust your data.

Alternatively you can use built-in schema import / export utilities.

这篇关于PySpark,通过 JSON 文件导入模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆