PySpark:将 SchemaRDD 映射到 SchemaRDD [英] PySpark: Map a SchemaRDD into a SchemaRDD

查看:24
本文介绍了PySpark:将 SchemaRDD 映射到 SchemaRDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在加载一个 JSON 对象文件作为 PySpark SchemaRDD.我想改变对象的形状"(基本上,我将它们展平),然后插入到 Hive 表中.

I am loading a file of JSON objects as a PySpark SchemaRDD. I want to change the "shape" of the objects (basically, I'm flattening them) and then insert into a Hive table.

我遇到的问题是以下返回一个 PipelinedRDD 而不是 SchemaRDD:

The problem I have is that the following returns a PipelinedRDD not a SchemaRDD:

log_json.map(flatten_function)

(其中 log_jsonSchemaRDD).

有没有办法保留类型,转换回所需的类型,或者有效地从新类型插入?

Is there either a way to preserve type, cast back to the desired type, or efficiently insert from the new type?

推荐答案

解决方案是applySchema:

mapped = log_json.map(flatten_function)
hive_context.applySchema(mapped, flat_schema).insertInto(name)

其中 flat_schema 是一个 StructType,以与您从 log_json.schema() 获得的相同方式表示架构(但显然是扁平化的).

Where flat_schema is a StructType representing the schema in the same way as you would obtain from log_json.schema() (but flattened, obviously).

这篇关于PySpark:将 SchemaRDD 映射到 SchemaRDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆