三角洲湖回滚 [英] Delta Lake rollback

查看:149
本文介绍了三角洲湖回滚的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

需要一种优雅的方式将Delta Lake回滚到以前的版本.

Need an elegant way to rollback Delta Lake to a previous version.

下面列出了我目前的方法:

My current approach is listed below:

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, testFolder)

spark.read.format("delta")
  .option("versionAsOf", 0)
  .load(testFolder)
  .write
  .mode("overwrite")
  .format("delta")
  .save(testFolder)

但是,这很丑陋,因为整个数据集都需要重写.似乎有些元更新就足够了,并且不需要数据I/O.有谁知道更好的方法吗?

This is ugly though, as the whole data set need to be rewritten. It seems that some meta update would be sufficient and no data I/O should be necessary. Anyone knows a better approach for this?

推荐答案

这是一个残酷的解决方案.这并不理想,但是考虑到用分区覆盖大数据集可能会很昂贵,因此这种简单的解决方案可能会有所帮助.

Here is a brutal solution. It is not ideal, but given that overwriting a large data set with partitions could be expensive, this easy solution could be helpful.

如果在所需的回滚时间之后对更新不是很敏感,只需删除_delta_log中晚于回滚时间的所有版本文件.未引用的文件可以稍后使用vacuum释放.

If you are not very sensitive to updates after the desired rollback time, simply remove all version files in _delta_log that are later than the rollback time. Unreferenced files could be released later using vacuum.

保留完整历史记录的另一种解决方案是:1) deltaTable.delete 2)将所有日志按顺序(直到版本号增加)复制到回滚到删除日志文件的末尾.这模仿了直到回滚日期为止的三角洲湖泊的形成.但这肯定不漂亮.

Another solution that preserves the full history is to 1) deltaTable.delete 2) Copy all logs up to the rollback sequentially (with increasing version number) to the end of the delete log file. This mimics the creation of the delta lake up to the rollback date. But it is surely not pretty.

这篇关于三角洲湖回滚的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆