Spark集成测试的Hive配置 [英] Hive configuration for Spark integration tests
问题描述
我正在寻找一种配置Hive进行Spark SQL集成测试的方法,以便将表写入临时目录或测试根目录下的某个位置。我的调查显示,这需要在<$ c $之前设置 fs.defaultFS
和 hive.metastore.warehouse.dir
c> HiveContext 被创建。
只需设置后者,如本答案所述即不适用于Spark 1.6.1。
val sqlc = new HiveContext(sparkContext)
sqlc.setConf(hive.metastore.warehouse.dir, hiveWarehouseDir)
表元数据放在正确的位置,但写入的文件转到/ user / hive /仓库。
如果没有显式路径保存数据框,例如
df.write.saveAsTable(tbl)
确定写入文件的位置通过调用 HiveMetastoreCatalog.hiveDefaultTableFilePath
,它使用默认数据库的位置
,这似乎是在 HiveContext
构造,因此在 HiveContext
构造之后设置 fs.defaultFS
没有任何效果。
除此之外,与集成测试非常相关,这也意味着 DROP TABLE tbl
只会删除表格元数据,但会留下表格文件,这会造成严重的后果。这是一个已知的问题 - 请参阅此处& here - 解决方案可能是确保 hive.metastore.warehouse .dir
== fs.defaultFS
+ user / hive / warehouse
。
简而言之, fs.defaultFS 和
hive.metastore.warehouse等配置属性如何。在
? HiveContext
构造函数运行之前,以编程方式设置dir
对于Spark 1.6,我认为您最好的选择可能是以编程方式创建一个hite-site.xml。
I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS
and hive.metastore.warehouse.dir
before HiveContext
is created.
Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.
val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)
The table metadata goes in the right place but the written files go to /user/hive/warehouse.
If a dataframe is saved without an explicit path, e.g.,
df.write.saveAsTable("tbl")
the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath
, which uses the location
of the default database, which seems to be cached during the HiveContext
construction, thus setting fs.defaultFS
after HiveContext
construction has no effect.
As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl
only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir
== fs.defaultFS
+ user/hive/warehouse
.
In short, how can configuration properties such as fs.defaultFS
and hive.metastore.warehouse.dir
be set programmatically before the HiveContext
constructor runs?
In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.
For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.
这篇关于Spark集成测试的Hive配置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!