附加新数据时，如何避免从S3中读取旧文件? [英] How to avoid reading old files from S3 when appending new data?

查看：133 发布时间：2020/8/23 2:16:21 amazon-s3 emr amazon-emr parquet bigdata

本文介绍了附加新数据时，如何避免从S3中读取旧文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在2小时内，spark作业正在运行，以将某些tgz文件转换为镶木地板. 作业将新数据附加到s3中的现有镶木中:

Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3:

df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet")

在提交火花的输出中，我可以看到花了大量时间在读取旧的镶木地板文件上，例如:

In spark-submit output I can see significant time is being spent on reading old parquet files, for example:

16/11/27 14:06:15 INFO S3NativeFileSystem:打开's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70- 43f5-b8b4-50b5b4d0c7da.snappy.parquet"以供阅读

16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading

16/11/27 14:06:15 INFO S3NativeFileSystem:密钥流 'foo.parquet/id = 123/day = 2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet' 试图定位"149195444"

16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key 'foo.parquet/id=123/day=2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet' seeking to position '149195444'

此操作似乎每个文件花费不到1秒，但是文件数量随时间增加(每个追加都会添加新文件)，这让我认为我的代码将无法扩展.

It looks like this operation takes less than 1 second per file, but the amount of files increases with time (each append adds new files), which makes me think that my code will not be able to scale.

有什么想法，如果我只需要添加新数据，如何避免从s3中读取旧的实木复合地板文件?

Any ideas how to avoid reading old parquet files from s3 if I just need to append new data?

我使用EMR 4.8.2和DirectParquetOutputCommitter:

I use EMR 4.8.2 and DirectParquetOutputCommitter:

sc._jsc.hadoopConfiguration().set('spark.sql.parquet.output.committer.class', 'org.apache.spark.sql.parquet.DirectParquetOutputCommitter')

附加新数据时，如何避免从S3中读取旧文件? [英] How to avoid reading old files from S3 when appending new data?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

附加新数据时，如何避免从S3中读取旧文件? [英] How to avoid reading old files from S3 when appending new data?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭