Spark-读和写回到相同的S3位置 [英] Spark - Read and Write back to same S3 location

查看:79
本文介绍了Spark-读和写回到相同的S3位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从S3位置读取数据集dataset1和dataset2.然后,我将它们转换并写回读取数据集2的相同位置.

I am reading a dataset dataset1 and dataset2 from S3 locations. I then transform them and write back to the same location where dataset2 was read from.

但是,我收到以下错误消息:

However, I get below error message:

An error occurred while calling o118.save. No such file or directory 's3://<myPrefix>/part-00001-a123a120-7d11-581a-b9df-bc53076d57894-c000.snappy.parquet

如果我尝试写入新的S3位置,例如 s3://dataset_new_path.../,然后代码正常工作.

If I try to write to a new S3 location e.g. s3://dataset_new_path.../ then the code works fine.

my_df \
  .write.mode('overwrite') \
  .format('parquet') \
  .save(s3_target_location)

注意:在读取数据框后,我尝试使用 .cache(),但仍然遇到相同的错误.

Note: I have tried using .cache() after reading in the dataframe but still get the same error.

推荐答案

导致问题的原因是您正在读取和写入尝试覆盖的路径.这是标准的Spark问题,与AWS Glue无关.

The reason this causes a problem is that you are reading and writing to the same path that you are trying to overwrite. It is standard Spark issue and nothing to do with AWS Glue.

Spark在DF上使用延迟转换,并在调用某些操作时触发.它创建DAG来保留有关应应用于DF的所有转换的信息.

Spark uses lazy transformation on DF and it is triggered when certain action is called. It creates DAG to keep information about all transformations which should be applied to DF.

当您从同一位置读取数据并使用覆盖写入时,使用覆盖写入"是DF的操作.当spark看到使用覆盖写入"时,在执行计划中添加了先删除路径,然后尝试读取已经空缺的路径.因此错误.

When you read data from same location and write using override, 'write using override' is action for DF. When spark sees 'write using override', in it's execution plan it adds to delete the path first, then trying to read that path which is already vacant; hence error.

可能的解决方法是先写入某个临时位置,然后将其用作源,在 dataset2 位置中覆盖

Possible workaround would be to write to some temp location first and then using it as source, override in dataset2 location

这篇关于Spark-读和写回到相同的S3位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆