sparklyr我可以将格式和路径选项传递到spark_write_table吗?或将saveAsTable与spark_write_orc一起使用? [英] sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?
问题描述
带有Hive的Spark 2.0
Spark 2.0 with Hive
假设我正在尝试将Spark数据框irisDf
写入orc 并将其保存到配置单元metastore中
Let's say I am trying to write a spark dataframe, irisDf
to orc and save it to the hive metastore
在Spark中,我会这样做,
In Spark I would do that like this,
irisDf.write.format("orc")
.mode("overwrite")
.option("path", "s3://my_bucket/iris/")
.saveAsTable("my_database.iris")
在sparklyr
中,我可以使用spark_write_table
函数,
In sparklyr
I can use the spark_write_table
function,
data("iris")
iris_spark <- copy_to(sc, iris, name = "iris")
output <- spark_write_table(
iris
,name = 'my_database.iris'
,mode = 'overwrite'
)
但这不允许我设置path
或format
我也可以使用spark_write_orc
spark_write_orc(
iris
, path = "s3://my_bucket/iris/"
, mode = "overwrite"
)
但没有saveAsTable
选项
现在,我可以使用invoke
语句复制Spark代码,
Now, I CAN use invoke
statements to replicate the Spark code,
sdf <- spark_dataframe(iris_spark)
writer <- invoke(sdf, "write")
writer %>%
invoke('format', 'orc') %>%
invoke('mode', 'overwrite') %>%
invoke('option','path', "s3://my_bucket/iris/") %>%
invoke('saveAsTable',"my_database.iris")
但是我想知道是否有替代方法将format
和path
选项传递到spark_write_table
或将saveAsTable
选项传递给spark_write_orc
?
But I am wondering if there is anyway to instead pass the format
and path
options into spark_write_table
or the saveAsTable
option into spark_write_orc
?
推荐答案
path
,该参数等效于本机DataFrameWriter
中的options
调用:
path
can be set using options
argument, which is equivalent to options
call in the native DataFrameWriter
:
spark_write_table(
iris_spark, name = 'my_database.iris', mode = 'overwrite',
options = list(path = "s3a://my_bucket/iris/")
)
在Spark中,默认情况下,这将创建一个表,该表存储为path
上的 Parquet (分区子目录可以用partition_by
参数指定.)
By default in Spark, this will create a table stored as Parquet at path
(partition subdirectories can be specified with the partition_by
argument).
到目前为止,尚无此类格式选项,但一个简单的解决方法是在运行时设置spark.sessionState.conf.defaultDataSourceName
属性
As of today there is no such option for format, but an easy workaround is to set spark.sessionState.conf.defaultDataSourceName
property, either on runtime
spark_session_config(
sc, "spark.sessionState.conf.defaultDataSourceName", "orc"
)
或创建会话时.
这篇关于sparklyr我可以将格式和路径选项传递到spark_write_table吗?或将saveAsTable与spark_write_orc一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!