Pyspark:映射一个SchemaRDD成SchemaRDD [英] Pyspark: Map a SchemaRDD into a SchemaRDD
问题描述
我加载JSON对象的文件作为pyspark SchemaRDD
。我想改变的对象(基本上,我压扁他们)的形状,然后插入到蜂巢表。
I am loading a file of JSON objects as a pyspark SchemaRDD
. I want to change the "shape" of the objects (basically, I'm flattening them) and then insert into a Hive table.
我的问题是,下面的返回 PipelinedRDD
不是 SchemaRDD
:
The problem I have is that the following returns a PipelinedRDD
not a SchemaRDD
:
log_json.map(flatten_function)
(这里的 log_json
是 SchemaRDD
)。
是否有任一种方法来preserve型,强制转换回所需的类型,或有效地从新型插入
Is there either a way to preserve type, cast back to the desired type, or efficiently insert from the new type?
推荐答案
解决方案是 applySchema
:
mapped = log_json.map(flatten_function)
hive_context.applySchema(mapped, flat_schema).insertInto(name)
在哪里flat_schema是 StructType
重新presenting架构以同样的方式,你会从获得log_json.schema()
(但夷为平地,很明显)。
Where flat_schema is a StructType
representing the schema in the same way as you would obtain from log_json.schema()
(but flattened, obviously).
这篇关于Pyspark:映射一个SchemaRDD成SchemaRDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!