SPARK SQL - 使用 DataFrames 和 JDBC 更新 MySql 表 [英] SPARK SQL - update MySql table using DataFrames and JDBC

查看:41
本文介绍了SPARK SQL - 使用 DataFrames 和 JDBC 更新 MySql 表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Spark SQL DataFrames 和 JDBC 连接在 MySql 上插入和更新一些数据.

I'm trying to insert and update some data on MySql using Spark SQL DataFrames and JDBC connection.

我已成功使用 SaveMode.Append 插入新数据.有没有办法从 Spark SQL 更新 MySql 表中已经存在的数据?

I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the data already existing in MySql Table from Spark SQL?

我要插入的代码是:

myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties)

如果我更改为 SaveMode.Overwrite 它会删除整个表并创建一个新表,我正在寻找类似 MySql 中可用的ON DUPLICATE KEY UPDATE"之类的东西

If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE KEY UPDATE" available in MySql

推荐答案

这是不可能的.目前(Spark 1.6.0/2.2.0 SNAPSHOT)Spark DataFrameWriter 只支持四种写入模式:

It is not possible. As for now (Spark 1.6.0 / 2.2.0 SNAPSHOT) Spark DataFrameWriter supports only four writing modes:

  • SaveMode.Overwrite:覆盖现有数据.
  • SaveMode.Append:追加数据.
  • SaveMode.Ignore:忽略操作(即无操作).
  • SaveMode.ErrorIfExists:默认选项,在运行时抛出异常.
  • SaveMode.Overwrite: overwrite the existing data.
  • SaveMode.Append: append the data.
  • SaveMode.Ignore: ignore the operation (i.e. no-op).
  • SaveMode.ErrorIfExists: default option, throw an exception at runtime.

您可以手动插入,例如使用 mapPartitions(因为您希望 UPSERT 操作应该是幂等的,因此易于实现),写入临时表并手动执行 upsert,或使用触发器.

You can insert manually for example using mapPartitions (since you want an UPSERT operation should be idempotent and as such easy to implement), write to temporary table and execute upsert manually, or use triggers.

总的来说,实现批量操作的 upsert 行为并保持良好的性能绝非易事.您必须记住,在一般情况下,将有多个并发事务(每个分区一个),因此您必须确保没有写冲突(通常通过使用特定于应用程序的分区)或提供适当的恢复程序.在实践中,执行和批量写入临时表并直接在数据库中解析 upsert 部分可能会更好.

In general achieving upsert behavior for batch operations and keeping decent performance is far from trivial. You have to remember that in general case there will be multiple concurrent transactions in place (one per each partition) so you have to ensure that there will no write conflicts (typically by using application specific partitioning) or provide appropriate recovery procedures. In practice it may be better to perform and batch writes to a temporary table and resolve upsert part directly in the database.

这篇关于SPARK SQL - 使用 DataFrames 和 JDBC 更新 MySql 表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆