SPARK SQL - 使用 DataFrames 和 JDBC 更新 MySql 表 [英] SPARK SQL - update MySql table using DataFrames and JDBC
问题描述
我正在尝试使用 Spark SQL DataFrames 和 JDBC 连接在 MySql 上插入和更新一些数据.
I'm trying to insert and update some data on MySql using Spark SQL DataFrames and JDBC connection.
我已成功使用 SaveMode.Append 插入新数据.有没有办法从 Spark SQL 更新 MySql 表中已经存在的数据?
I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the data already existing in MySql Table from Spark SQL?
我要插入的代码是:
myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties)
如果我更改为 SaveMode.Overwrite,它会删除整个表并创建一个新表,我正在寻找类似 MySql 中可用的ON DUPLICATE KEY UPDATE"之类的东西
If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE KEY UPDATE" available in MySql
推荐答案
这是不可能的.至于现在(Spark 1.6.0/2.2.0 SNAPSHOT)Spark DataFrameWriter
只支持四种写入模式:
It is not possible. As for now (Spark 1.6.0 / 2.2.0 SNAPSHOT) Spark DataFrameWriter
supports only four writing modes:
SaveMode.Overwrite
:覆盖现有数据.SaveMode.Append
:追加数据.SaveMode.Ignore
:忽略操作(即无操作).SaveMode.ErrorIfExists
:默认选项,在运行时抛出异常.
SaveMode.Overwrite
: overwrite the existing data.SaveMode.Append
: append the data.SaveMode.Ignore
: ignore the operation (i.e. no-op).SaveMode.ErrorIfExists
: default option, throw an exception at runtime.
您可以手动插入,例如使用 mapPartitions
(因为您希望 UPSERT 操作应该是幂等的,因此易于实现),写入临时表并手动执行 upsert,或使用触发器.
You can insert manually for example using mapPartitions
(since you want an UPSERT operation should be idempotent and as such easy to implement), write to temporary table and execute upsert manually, or use triggers.
总的来说,实现批量操作的 upsert 行为并保持良好的性能绝非易事.您必须记住,在一般情况下,将有多个并发事务(每个分区一个),因此您必须确保没有写冲突(通常通过使用特定于应用程序的分区)或提供适当的恢复程序.在实践中,执行和批量写入临时表并直接在数据库中解析 upsert 部分可能会更好.
In general achieving upsert behavior for batch operations and keeping decent performance is far from trivial. You have to remember that in general case there will be multiple concurrent transactions in place (one per each partition) so you have to ensure that there will no write conflicts (typically by using application specific partitioning) or provide appropriate recovery procedures. In practice it may be better to perform and batch writes to a temporary table and resolve upsert part directly in the database.
这篇关于SPARK SQL - 使用 DataFrames 和 JDBC 更新 MySql 表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!