如何使用PySpark的JDBC覆盖数据而不丢失架构? [英] How to overwrite data with PySpark's JDBC without losing schema?

查看:292
本文介绍了如何使用PySpark的JDBC覆盖数据而不丢失架构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataFrame,我愿意将其写入PostgreSQL数据库.如果我只是使用覆盖"模式,例如:

I have a DataFrame that I'm willing to write it to a PostgreSQL database. If I simply use the "overwrite" mode, like:

df.write.jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)

将重新创建表并保存数据.但是问题是我想将PRIMARY KEY和Indexes保留在表中.因此,我想只覆盖数据,保持表模式,或者在事后添加主键约束和索引. PySpark可以做到吗?还是我需要连接到PostgreSQL并执行命令自己添加索引?

The table is recreated and the data is saved. But the problem is that I'd like to keep the PRIMARY KEY and Indexes in the table. So, I'd like to either overwrite only the data, keeping the table schema or to add the primary key constraint and indexes afterward. Can either one be done with PySpark? Or do I need to connect to the PostgreSQL and execute the commands to add the indexes myself?

推荐答案

mode="overwrite"的默认行为是首先删除表,然后使用新数据重新创建它.您可以改为通过包含option("truncate", "true")截断数据,然后推送自己的数据:

The default behavior for mode="overwrite" is to first delete the table, then recreate it with the new data. You can instead truncate the data by including option("truncate", "true") and then push your own:

df.write.option("truncate", "true").jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)

这样,您无需重新创建表,因此它不应对架构进行任何修改.

This way, you are not recreating the table so it shouldn't make any modifications to your schema.

这篇关于如何使用PySpark的JDBC覆盖数据而不丢失架构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆