SPARK SQL-使用DataFrames和JDBC更新MySql表 [英] SPARK SQL - update MySql table using DataFrames and JDBC

查看:138
本文介绍了SPARK SQL-使用DataFrames和JDBC更新MySql表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Spark SQL DataFrames和JDBC连接在MySql上插入和更新一些数据.

I'm trying to insert and update some data on MySql using Spark SQL DataFrames and JDBC connection.

我已经成功使用SaveMode.Append插入了新数据.有没有办法从Spark SQL更新MySql表中已经存在的数据?

I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the data already existing in MySql Table from Spark SQL?

我要插入的代码是:

myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties)

如果我更改为SaveMode.覆盖它会删除整个表并创建一个新表,我正在寻找MySql中可用的"ON DUPLICATE KEY UPDATE"之类的东西

If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE KEY UPDATE" available in MySql

推荐答案

这是不可能的.目前(Spark 1.6.0/2.2.0 SNAPSHOT)Spark DataFrameWriter仅支持四种写入模式:

It is not possible. As for now (Spark 1.6.0 / 2.2.0 SNAPSHOT) Spark DataFrameWriter supports only four writing modes:

  • SaveMode.Overwrite:覆盖现有数据.
  • SaveMode.Append:附加数据.
  • SaveMode.Ignore:忽略该操作(即无操作).
  • SaveMode.ErrorIfExists:默认选项,在运行时引发异常.
  • SaveMode.Overwrite: overwrite the existing data.
  • SaveMode.Append: append the data.
  • SaveMode.Ignore: ignore the operation (i.e. no-op).
  • SaveMode.ErrorIfExists: default option, throw an exception at runtime.

例如,您可以使用mapPartitions手动插入(因为您希望UPSERT操作应该是幂等的,并且易于实现),写入临时表并手动执行upsert或使用触发器.

You can insert manually for example using mapPartitions (since you want an UPSERT operation should be idempotent and as such easy to implement), write to temporary table and execute upsert manually, or use triggers.

通常,实现批处理操作的upsert行为并保持良好的性能绝非易事.您必须记住,在一般情况下,会存在多个并发事务(每个分区一个),因此您必须确保不会发生写冲突(通常通过使用应用程序特定的分区)或提供适当的恢复过程.在实践中,执行并批量写入临时表并直接在数据库中解析加插部分可能会更好.

In general achieving upsert behavior for batch operations and keeping decent performance is far from trivial. You have to remember that in general case there will be multiple concurrent transactions in place (one per each partition) so you have to ensure that there will no write conflicts (typically by using application specific partitioning) or provide appropriate recovery procedures. In practice it may be better to perform and batch writes to a temporary table and resolve upsert part directly in the database.

这篇关于SPARK SQL-使用DataFrames和JDBC更新MySql表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆