使用提供为JSON文件的架构创建数据框 [英] Create dataframe with schema provided as JSON file

查看：81 发布时间：2021/4/8 20:30:08 apache-spark pyspark apache-spark-sql pyspark-dataframes

本文介绍了使用提供为JSON文件的架构创建数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何创建包含2个JSON文件的pyspark数据框?

How can I create a pyspark data frame with 2 JSON files?

文件1:此文件包含完整的数据
file2:此文件仅具有file1数据的架构.

文件1

{"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"}

文件2

[{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},"name":"DESCR","nullable":true,"type":"string"},{"metadata":{},"name":"DESCRSHORT","nullable":true,"type":"string"}],"type":"struct"}]

推荐答案

首先，您必须使用Python json.load 读取架构文件，然后将其转换为 DataType StructType.fromJson .


You have to read, first, the schema file using Python json.load, then convert it to DataType using StructType.fromJson.
import json
from pyspark.sql.types import StructType

with open("/path/to/file2.json") as f:
    json_schema = json.load(f)

schema = StructType.fromJson(json_schema[0])

现在只需将该架构传递给DataFrame Reader:
Now just pass that schema to DataFrame Reader:
df = spark.read.schema(schema).json("/path/to/file1.json")

df.show()

#+---------+----------+----------+-------------------+----------+
#|RESIDENCY|     EFFDT|EFF_STATUS|              DESCR|DESCRSHORT|
#+---------+----------+----------+-------------------+----------+
#|      AUS|01-01-1900|         A|Australian Resident|Australian|
#+---------+----------+----------+-------------------+----------+

  
如果包含架构的文件位于GCS中，则可以使用Spark或Hadoop API来获取文件内容.这是一个使用Spark的示例:
If the file containing the schema is located in GCS, you can use Spark or Hadoop API to get the file content. Here is an example using Spark:
file_content = spark.read.text("/path/to/file2.json").rdd.map(
    lambda r: " ".join([str(elt) for elt in r])
).reduce(
    lambda x, y: "\n".join([x, y])
)

json_schema = json.loads(file_content)


                        这篇关于使用提供为JSON文件的架构创建数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用提供为JSON文件的架构创建数据框 [英] Create dataframe with schema provided as JSON file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用提供为JSON文件的架构创建数据框 [英] Create dataframe with schema provided as JSON file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭