sparklyr我可以将格式和路径选项传递到spark_write_table吗?或将saveAsTable与spark_write_orc一起使用? [英] sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?

查看:368
本文介绍了sparklyr我可以将格式和路径选项传递到spark_write_table吗?或将saveAsTable与spark_write_orc一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

带有Hive的Spark 2.0

Spark 2.0 with Hive

假设我正在尝试将Spark数据框irisDf写入orc 并将其保存到配置单元metastore中

Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore

在Spark中,我会这样做,

In Spark I would do that like this,

irisDf.write.format("orc")
    .mode("overwrite")
    .option("path", "s3://my_bucket/iris/")
    .saveAsTable("my_database.iris")

sparklyr中,我可以使用spark_write_table函数,

In sparklyr I can use the spark_write_tablefunction,

data("iris")
iris_spark <- copy_to(sc, iris, name = "iris")
output <- spark_write_table(
   iris
  ,name = 'my_database.iris'
  ,mode = 'overwrite'
)

但这不允许我设置pathformat

我也可以使用spark_write_orc

spark_write_orc(
    iris
  , path = "s3://my_bucket/iris/"
  , mode = "overwrite"
)

但没有saveAsTable选项

现在,我可以使用invoke语句复制Spark代码,

Now, I CAN use invoke statements to replicate the Spark code,

  sdf <- spark_dataframe(iris_spark)
  writer <- invoke(sdf, "write")
  writer %>% 
    invoke('format', 'orc') %>% 
    invoke('mode', 'overwrite') %>% 
    invoke('option','path', "s3://my_bucket/iris/") %>% 
    invoke('saveAsTable',"my_database.iris")

但是我想知道是否有替代方法将formatpath选项传递到spark_write_table或将saveAsTable选项传递给spark_write_orc?

But I am wondering if there is anyway to instead pass the format and path options into spark_write_table or the saveAsTable option into spark_write_orc?

推荐答案

path,该参数等效于本机DataFrameWriter中的options调用:

path can be set using options argument, which is equivalent to options call in the native DataFrameWriter:

spark_write_table(
  iris_spark, name = 'my_database.iris', mode = 'overwrite', 
  options = list(path = "s3a://my_bucket/iris/")
)

在Spark中,默认情况下,这将创建一个表,该表存储为path上的 Parquet (分区子目录可以用partition_by参数指定.)

By default in Spark, this will create a table stored as Parquet at path (partition subdirectories can be specified with the partition_by argument).

到目前为止,尚无此类格式选项,但一个简单的解决方法是在运行时设置spark.sessionState.conf.defaultDataSourceName属性

As of today there is no such option for format, but an easy workaround is to set spark.sessionState.conf.defaultDataSourceName property, either on runtime

spark_session_config(
  sc, "spark.sessionState.conf.defaultDataSourceName", "orc"
)

或创建会话时.

这篇关于sparklyr我可以将格式和路径选项传递到spark_write_table吗?或将saveAsTable与spark_write_orc一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆