sparklyr 我可以将格式和路径选项传递给 spark_write_table 吗?或将 saveAsTable 与 spark_write_orc 一起使用? [英] sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?

查看:26
本文介绍了sparklyr 我可以将格式和路径选项传递给 spark_write_table 吗?或将 saveAsTable 与 spark_write_orc 一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

带 Hive 的 Spark 2.0

Spark 2.0 with Hive

假设我正在尝试编写一个 spark 数据帧,irisDf 到 orc 将其保存到 hive Metastore

Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore

在 Spark 中,我会这样做,

In Spark I would do that like this,

irisDf.write.format("orc")
    .mode("overwrite")
    .option("path", "s3://my_bucket/iris/")
    .saveAsTable("my_database.iris")

sparklyr中,我可以使用spark_write_table函数,

In sparklyr I can use the spark_write_tablefunction,

data("iris")
iris_spark <- copy_to(sc, iris, name = "iris")
output <- spark_write_table(
   iris
  ,name = 'my_database.iris'
  ,mode = 'overwrite'
)

但这不允许我设置 pathformat

But this doesn't allow me to set path or format

我也可以使用 spark_write_orc

spark_write_orc(
    iris
  , path = "s3://my_bucket/iris/"
  , mode = "overwrite"
)

但它没有 saveAsTable 选项

现在,我可以使用 invoke 语句来复制 Spark 代码,

Now, I CAN use invoke statements to replicate the Spark code,

  sdf <- spark_dataframe(iris_spark)
  writer <- invoke(sdf, "write")
  writer %>% 
    invoke('format', 'orc') %>% 
    invoke('mode', 'overwrite') %>% 
    invoke('option','path', "s3://my_bucket/iris/") %>% 
    invoke('saveAsTable',"my_database.iris")

但我想知道是否有办法将 formatpath 选项传递到 spark_write_tablesaveAsTable 选项到 spark_write_orc?

But I am wondering if there is anyway to instead pass the format and path options into spark_write_table or the saveAsTable option into spark_write_orc?

推荐答案

path 可以使用 options 参数设置,相当于 options> 调用本机 DataFrameWriter:

path can be set using options argument, which is equivalent to options call in the native DataFrameWriter:

spark_write_table(
  iris_spark, name = 'my_database.iris', mode = 'overwrite', 
  options = list(path = "s3a://my_bucket/iris/")
)

默认情况下,在 Spark 中,这将创建一个存储为 Parquet 的表,位于 路径(可以使用partition_by 参数指定分区子目录).

By default in Spark, this will create a table stored as Parquet at path (partition subdirectories can be specified with the partition_by argument).

截至今天,没有这样的格式选项,但一个简单的解决方法是在运行时设置 spark.sessionState.conf.defaultDataSourceName 属性

As of today there is no such option for format, but an easy workaround is to set spark.sessionState.conf.defaultDataSourceName property, either on runtime

spark_session_config(
  sc, "spark.sessionState.conf.defaultDataSourceName", "orc"
)

或者在您创建会话时.

这篇关于sparklyr 我可以将格式和路径选项传递给 spark_write_table 吗?或将 saveAsTable 与 spark_write_orc 一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆