Pyspark:映射一个SchemaRDD成SchemaRDD [英] Pyspark: Map a SchemaRDD into a SchemaRDD

查看:818
本文介绍了Pyspark:映射一个SchemaRDD成SchemaRDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我加载JSON对象的文件作为pyspark SchemaRDD 。我想改变的对象(基本上,我压扁他们)的形状,然后插入到蜂巢表。

I am loading a file of JSON objects as a pyspark SchemaRDD. I want to change the "shape" of the objects (basically, I'm flattening them) and then insert into a Hive table.

我的问题是,下面的返回 PipelinedRDD 不是 SchemaRDD

The problem I have is that the following returns a PipelinedRDD not a SchemaRDD:

log_json.map(flatten_function)

(这里的 log_json SchemaRDD )。

是否有任一种方法来preserve型,强制转换回所需的类型,或有效地从新型插入

Is there either a way to preserve type, cast back to the desired type, or efficiently insert from the new type?

推荐答案

解决方案是 applySchema

mapped = log_json.map(flatten_function)
hive_context.applySchema(mapped, flat_schema).insertInto(name)

在哪里flat_schema是 StructType 重新presenting架构以同样的方式,你会从获得log_json.schema()(但夷为平地,很明显)。

Where flat_schema is a StructType representing the schema in the same way as you would obtain from log_json.schema() (but flattened, obviously).

这篇关于Pyspark:映射一个SchemaRDD成SchemaRDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆