是否可以删除底层镶木地板文件而不会对 DeltaLake _delta_log 产生负面影响 [英] Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

查看：30 发布时间：2021/11/14 23:13:27 apache-spark pyspark apache-spark-sql delta-lake

本文介绍了是否可以删除底层镶木地板文件而不会对 DeltaLake _delta_log 产生负面影响的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 DeltaLake 表上使用 .vacuum() 非常慢(参见 Delta Lake (OSS) 表在 EMR 和 S3 上 - 真空需要很长时间没有工作).

Using .vacuum() on a DeltaLake table is very slow (see Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs).

如果我手动删除了底层的 parquet 文件并且没有添加新的 json 日志文件或添加新的 .checkpoint.parquet 文件并更改 _delta_log/_last_checkpoint 指向它的文件；如果有的话，对 DeltaLake 表的负面影响是什么?

If I manually deleted the underlying parquet files and did not add a new json log file or add a new .checkpoint.parquet file and change the _delta_log/_last_checkpoint file that points to it; what would the negative impacts to the DeltaLake table be, if any?

显然，时间旅行，即加载依赖于我删除的镶木地板文件的先前版本的表格，将不起作用.我想知道的是，在读取、写入或附加到当前版本的 DeltaLake 表时会不会有任何问题?

Obviously time-traveling, i.e. loading a previous version of the table that relied on the parquet files I removed, would not work. What I want to know is, would there be any issues reading, writing, or appending to the current version of the DeltaLake table?

我想在 pySpark 中做什么:

What I am thinking of doing in pySpark:

### Assuming a working SparkSession as `spark`

from subprocess import check_output
import json
from pyspark.sql import functions as F

awscmd = "aws s3 cp s3://my_s3_bucket/delta/_delta_log/_last_checkpoint -"
last_checkpoint = str(json.loads(check_output(awscmd, shell=True).decode("utf-8")).get('version')).zfill(20)

s3_bucket_path = "s3a://my_s3_bucket/delta/"

df_chkpt_del = (
    spark.read.format("parquet")
    .load(f"{s3_bucket_path}/_delta_log/{last_checkpoint}.checkpoint.parquet")
    .where(F.col("remove").isNotNull())
    .select("remove.*")
    .withColumn("deletionTimestamp", F.from_unixtime(F.col("deletionTimestamp")/1000))
    .withColumn("delDateDiffDays", F.datediff(F.col("deletionTimestamp"), F.current_timestamp()))
    .where(F.col("delDateDiffDays") < -7 )
)

这里有很多选择.一种可能是:

There are a lot of options from here. One could be:

df_chkpt_del.select("path").toPandas().to_csv("files_to_delete.csv", index=False)

我可以在哪里将 files_to_delete.csv 读入 bash 数组，然后使用简单的 bash for 循环将每个镶木地板文件 s3 路径传递给 aws s3 rm 命令将文件一一删除.

Where I could read files_to_delete.csv into a bash array and then use a simple bash for loop passing each parquet file s3 path to an aws s3 rm command to remove the files one by one.

这可能比 vacuum() 慢，但至少它在工作时不会消耗集群资源.

This may be slower than vacuum(), but at least it will not be consuming cluster resources while it is working.

如果我这样做，我是否还必须:

If I do this, will I also have to either:

编写一个新的 _delta_log/000000000000000#####.json 文件来正确记录这些更改?
编写一个新的 000000000000000#####.checkpoint.parquet 文件来正确记录这些更改并将 _delta_log/_last_checkpoint 文件更改为指向该 checkpoint.parquet 文件?

write a new _delta_log/000000000000000#####.json file that correctly documents these changes?
write a new 000000000000000#####.checkpoint.parquet file that correctly documents these changes and change the _delta_log/_last_checkpoint file to point to that checkpoint.parquet file?

第二种选择会更容易.

但是，如果我只是删除文件而不更改_delta_log中的任何内容，如果不会产生负面影响，那么这将是最简单的.

However, if there will be no negative effects if I just remove the files and don't change anything in the _delta_log, then that would be the easiest.

是否可以删除底层镶木地板文件而不会对 DeltaLake _delta_log 产生负面影响 [英] Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

是否可以删除底层镶木地板文件而不会对 DeltaLake _delta_log 产生负面影响 [英] Can underlying parquet files be deleted without negatively impacting DeltaLake _delta_log

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭