将分析数据从 Spark 插入 Postgres [英] Inserting Analytic data from Spark to Postgres

查看:67
本文介绍了将分析数据从 Spark 插入 Postgres的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 Cassandra 数据库,我通过 Apache Spark 使用 SparkSQL 从中分析了数据.现在我想将那些分析过的数据插入到 PostgreSQL 中.除了使用 PostgreSQL 驱动程序(我使用 postREST 和 Driver 我想知道是否有像 saveToCassandra() 这样的方法)之外,还有什么方法可以直接实现这一点吗?

I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like saveToCassandra())?

推荐答案

目前还没有将 RDD 写入任何 DBMS 的本地实现.以下是 Spark 用户列表中相关讨论的链接:两个

At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two

一般来说,性能最好的方法如下:

In general, the most performant approach would be the following:

  1. 验证RDD中的分区数,不能太低也不能太高.20-50 个分区应该没问题,如果数量较少 - 调用 repartition 20 个分区,如果较高 - 调用 coalesce 到 50 个分区
  2. 调用 mapPartition 转换,在它内部调用函数以使用 JDBC 将记录插入到 DBMS.在此函数中,您打开与数据库的连接并使用带有 此 API,它将允许您消除对每个记录的单独命令的需要 - 这样插入的处理速度会快得多
  1. Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
  2. Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster

通过这种方式,您可以使用多达 50 个并行连接(取决于您的 Spark 集群大小及其配置)以并行方式将数据插入 Postgres.整个方法可能被实现为接受 RDD 和连接字符串的 Java/Scala 函数

This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string

这篇关于将分析数据从 Spark 插入 Postgres的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆