使用Sparklyr将数据从数据库传输到Spark [英] Transfer data from database to Spark using sparklyr

查看:185
本文介绍了使用Sparklyr将数据从数据库传输到Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在数据库中有一些数据,我想使用 sparklyr 在Spark中使用它们.

I have some data in a database, and I want to work with it in Spark, using sparklyr.

我可以使用基于 DBI 的程序包将数据库中的数据导入R

I can use a DBI-based package to import the data from the database into R

dbconn <- dbConnect(<some connection args>)
data_in_r <- dbReadTable(dbconn, "a table") 

然后使用

sconn <- spark_connect(<some connection args>)
data_ptr <- copy_to(sconn, data_in_r)

对于大型数据集,两次复制很慢.

Copying twice is slow for big datasets.

如何将数据直接从数据库复制到Spark中?

How can I copy data directly from the database into Spark?

sparklyr 有几个 spark_read_*() 导入功能,但与数据库无关. sdf_import() 看起来很可能,但是目前尚不清楚在这种情况下使用它.

sparklyr has several spark_read_*() functions for import, but nothing database related. sdf_import() looks like a possibility, but it isn't clear how to use it in this context.

推荐答案

Sparklyr> = 0.6.0

您可以使用spark_read_jdbc.

Sparklyr< 0.6.0

我希望那里有一个更优雅的解决方案,但这是使用低级API的一个最小示例:

I hope there is a more elegant solution out there but here is a minimal example using low level API:

  • 确保Spark可以访问所需的JDBC驱动程序,例如,通过将其坐标添加到spark.jars.packages.例如,对于PostgreSQL(针对当前版本进行调整),您可以添加:

  • Make sure that Spark has access to the required JDBC driver, for example by adding its coordinates to spark.jars.packages. For example with PostgreSQL (adjust for current version) you could add:

spark.jars.packages org.postgresql:postgresql:9.4.1212

SPARK_HOME/conf/spark-defaults.conf

加载数据并注册为临时视图:

Load data and register as temporary view:

name <- "foo"

spark_session(sc) %>% 
  invoke("read") %>% 
  # JDBC URL and table name
  invoke("option", "url", "jdbc:postgresql://host/database") %>% 
  invoke("option", "dbtable", "table") %>% 
  # Add optional credentials
  invoke("option", "user", "scott") %>%
  invoke("option", "password", "tiger") %>% 
  # Driver class, here for PostgreSQL
  invoke("option", "driver", "org.postgresql.Driver") %>% 
  # Read and register as a temporary view
  invoke("format", "jdbc") %>% 
  invoke("load") %>% 
  # Spark 2.x, registerTempTable in 1.x
  invoke("createOrReplaceTempView", name)

您可以使用environment一次传递多个options:

You can pass multiple options at once using an environment:

invoke("options", as.environment(list(
  user="scott", password="tiger", url="jdbc:..."
)))

  • 使用dplyr加载临时视图:

  • Load temporary view with dplyr:

    dplyr::tbl(sc, name)
    

  • 一定要阅读更多有关JDBC选项的信息,重点是partitionColumn*BoundnumPartitions.

    有关其他详细信息,请参见例如如何使用JDBC源在(Py)Spark中写入和读取数据?如何使用DataFrame和JDBC连接为慢速Spark作业提高性能?

    For additional details see for example How to use JDBC source to write and read data in (Py)Spark? and How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

    这篇关于使用Sparklyr将数据从数据库传输到Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆