在读取Hive表之前过滤分区（Spark） [英] Filter Partition Before Reading Hive table (Spark)

查看：1084 发布时间：2018/6/6 11:18:23 apache-spark hive hdfs

本文介绍了在读取Hive表之前过滤分区（Spark）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

目前我正在尝试使用最新的date_processed过滤Hive表。

这个表格被分区。

系统
日期处理
区域

我设法过滤它的唯一方法是通过执行连接查询：

  query =select * from contracts_table as a join（从b）在a.date_processed = b.maximum中选择（max（date_processed）作为从contract_table中的最大值）
  pre> 
 
 这种方法非常耗时，因为我必须为25个表执行相同的操作。
 
 
 任何一种知道直接读取Spark< 1.6中的表的最新加载分区的方法
 
 
这是我用来读取的方法。
  public static DataFrame loadAndFilter（String query）
 {
 return SparkContextSingleton.getHiveContext（）。sql（+ query）; 
} 
  
非常感谢！ 
 
 
 
 
 
 code> val partitionsDF = hiveContext.sql（show partitions TABLE_NAME）
  
值可以是解析，获得最大值。
 
Currently I'm trying to filter a Hive table by the latest date_processed. 

The table is partitioned by.

System
date_processed 
Region

The only way I've managed to filter it, is by doing a join query:
query = "select * from contracts_table as a join (select (max(date_processed) as maximum from contract_table as b) on a.date_processed = b.maximum"
This way is really time consuming, as I have to do the same procedure for 25 tables. 

Any one Knows a way to read directly the latest loaded partition of a table in Spark <1.6 

This is the method I'm using to read.
public static DataFrame loadAndFilter (String query)
{
        return SparkContextSingleton.getHiveContext().sql(+query);
}
Many thanks!
 解决方案 
Dataframe with all table partitions can be received by:
val partitionsDF = hiveContext.sql("show partitions TABLE_NAME")
Values can be parsed, for get max value.

                        这篇关于在读取Hive表之前过滤分区（Spark）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    

                    
                        查看全文

在读取Hive表之前过滤分区（Spark） [英] Filter Partition Before Reading Hive table (Spark)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在读取Hive表之前过滤分区（Spark） [英] Filter Partition Before Reading Hive table (Spark)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭