PySpark:将 SchemaRDD 映射到 SchemaRDD [英] PySpark: Map a SchemaRDD into a SchemaRDD
问题描述
我正在加载一个 JSON 对象文件作为 PySpark SchemaRDD
.我想改变对象的形状"(基本上,我将它们展平),然后插入到 Hive 表中.
I am loading a file of JSON objects as a PySpark SchemaRDD
. I want to change the "shape" of the objects (basically, I'm flattening them) and then insert into a Hive table.
我遇到的问题是以下返回一个 PipelinedRDD
而不是 SchemaRDD
:
The problem I have is that the following returns a PipelinedRDD
not a SchemaRDD
:
log_json.map(flatten_function)
(其中 log_json
是 SchemaRDD
).
有没有办法保留类型,转换回所需的类型,或者有效地从新类型插入?
Is there either a way to preserve type, cast back to the desired type, or efficiently insert from the new type?
推荐答案
解决方案是applySchema
:
mapped = log_json.map(flatten_function)
hive_context.applySchema(mapped, flat_schema).insertInto(name)
其中 flat_schema 是一个 StructType
,以与您从 log_json.schema()
获得的相同方式表示架构(但显然是扁平化的).
Where flat_schema is a StructType
representing the schema in the same way as you would obtain from log_json.schema()
(but flattened, obviously).
这篇关于PySpark:将 SchemaRDD 映射到 SchemaRDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!