Spark集成测试的Hive配置 [英] Hive configuration for Spark integration tests

查看:309
本文介绍了Spark集成测试的Hive配置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种配置Hive进行Spark SQL集成测试的方法,以便将表写入临时目录或测试根目录下的某个位置。我的调查显示,这需要在<$ c $之前设置 fs.defaultFS hive.metastore.warehouse.dir c> HiveContext 被创建。

只需设置后者,如本答案所述即不适用于Spark 1.6.1。

  val sqlc = new HiveContext(sparkContext)
sqlc.setConf(hive.metastore.warehouse.dir, hiveWarehouseDir)

表元数据放在正确的位置,但写入的文件转到/ user / hive /仓库。



如果没有显式路径保存数据框,例如

  df.write.saveAsTable(tbl)

确定写入文件的位置通过调用 HiveMetastoreCatalog.hiveDefaultTableFilePath ,它使用默认数据库的位置,这似乎是在 HiveContext 构造,因此在 HiveContext 构造之后设置 fs.defaultFS 没有任何效果。

除此之外,与集成测试非常相关,这也意味着 DROP TABLE tbl 只会删除表格元数据,但会留下表格文件,这会造成严重的后果。这是一个已知的问题 - 请参阅此处& here - 解决方案可能是确保 hive.metastore.warehouse .dir == fs.defaultFS + user / hive / warehouse



简而言之, fs.defaultFS 和 hive.metastore.warehouse等配置属性如何。在 HiveContext 构造函数运行之前,以编程方式设置dir



对于Spark 1.6,我认为您最好的选择可能是以编程方式创建一个hite-site.xml。


I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS and hive.metastore.warehouse.dir before HiveContext is created.

Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.

val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)

The table metadata goes in the right place but the written files go to /user/hive/warehouse.

If a dataframe is saved without an explicit path, e.g.,

df.write.saveAsTable("tbl")

the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath, which uses the location of the default database, which seems to be cached during the HiveContext construction, thus setting fs.defaultFS after HiveContext construction has no effect.

As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir == fs.defaultFS + user/hive/warehouse.

In short, how can configuration properties such as fs.defaultFS and hive.metastore.warehouse.dir be set programmatically before the HiveContext constructor runs?

解决方案

In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.

For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.

这篇关于Spark集成测试的Hive配置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆