将分析数据从Spark插入Postgres [英] Inserting Analytic data from Spark to Postgres

查看:210
本文介绍了将分析数据从Spark插入Postgres的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有Cassandra数据库,我从中使用SparkSQL通过Apache Spark分析数据。现在我想将这些分析的数据插入PostgreSQL。有没有什么方法来实现这一点,直接使用PostgreSQL驱动程序(我实现它使用postREST和驱动程序我想知道是否有任何方法像 saveToCassandra())?

I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to know whether there is any methods like saveToCassandra())?

推荐答案

目前没有将RDD写入任何DBMS的本机实现。以下是Spark用户列表中相关讨论的链接:两个

At the moment there is no native implementation of writing the RDD to any DBMS. Here are the links to the related discussions in the Spark user list: one, two

一般来说,最有效的方法如下:

In general, the most performant approach would be the following:


  1. 验证RDD中的分区数,它不应该太低和太高。 20-50分区应该很好,如果数量较少 - 调用重新分区有20个分区,如果较高 - 调用 coalesce 到50个分区

  2. 调用 mapPartition 转换,在其中调用函数使用JDBC将记录插入到DBMS。在此功能中,打开与数据库的连接,然后使用COPY命令和此API < a>,它将允许您消除为每个记录需要一个单独的命令 - 这样插入将被处理得更快。

  1. Validate the number of partitions in RDD, it should not be too low and too high. 20-50 partitions should be fine, if the number is lower - call repartition with 20 partitions, if higher - call coalesce to 50 partitions
  2. Call the mapPartition transformation, inside of it call the function to insert the records to your DBMS using JDBC. In this function you open the connection to your database and use the COPY command with this API, it would allow you to eliminate the need for a separate command for each record - this way the insert would be processed much faster

这样你就可以使用多达50个并行连接(取决于Spark集群大小及其配置)以并行方式将数据插入到Postgres中。整个方法可以实现为接受RDD和连接字符串

This way you would insert the data into Postgres in a parallel fashion utilizing up to 50 parallel connection (depends on your Spark cluster size and its configuration). The whole approach might be implemented as a Java/Scala function accepting the RDD and the connection string

这篇关于将分析数据从Spark插入Postgres的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆