如何在DataFrame Spark 1.6中加载特定的Hive分区? [英] How to load specific Hive partition in DataFrame Spark 1.6?

查看:190
本文介绍了如何在DataFrame Spark 1.6中加载特定的Hive分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


按照官方的 doc 我们无法在DataFrame中添加特定的
配置单元分区

Spark 1.6 onwards as per the official doc we cannot add specific hive partitions to DataFrame

Till Spark 1.5以下用于处理数据帧
将有实体列和数据,如下所示 -

Till Spark 1.5 the following used to work and the dataframe would have entity column and the data, as shown below -



DataFrame df = hiveContext.read().format("orc").load("path/to/table/entity=xyz")

但是,这在Spark 1.6中不起作用。

However, this would not work in Spark 1.6.

如果我给出如下所示的基本路径,它不包含我想在DataFrame中使用的实体列,如下所示 -

If I give base path like the following it does not contain entity column which I want in DataFrame, as shown below -

DataFrame df = hiveContext.read().format("orc").load("path/to/table/") 




如何在数据框中加载特定的配置单元分区?删除此功能后,
驱动程序是什么?

How do I load specific hive partition in a dataframe? What was the driver behind removing this feature?

我相信这很有效。有没有其他方法可以在Spark 1.6中存档?

据我的理解,Spark 1.6加载所有分区,如果我筛选特定的分区,高效,它会触发内存并抛出GC(垃圾收集)错误,因为数千个分区会被加载到内存中,而不是特定的分区。

As per my understanding, Spark 1.6 loads all partitions and if I filter for specific partitions it is not efficient, it hits memory and throws GC(Garbage Collection) errors because of thousands of partitions get loaded into memory and not the specific partition.

请指导。

推荐答案

要使用Spark 1.6在DataFrame中添加特定分区,我们必须先执行以下操作设置 basePath 然后给出分区的路径需要加载

To add specific partition in a DataFrame using Spark 1.6 we have to do the following first set basePath and then give path of partition needs to be loaded

DataFrame df = hiveContext.read().format("orc").
               option("basePath", "path/to/table/").
               load("path/to/table/entity=xyz")

将仅加载DataFrame中的特定分区。

So above code will load only specific partition in a DataFrame.

这篇关于如何在DataFrame Spark 1.6中加载特定的Hive分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆