附加新数据时,如何避免从S3中读取旧文件? [英] How to avoid reading old files from S3 when appending new data?

查看:133
本文介绍了附加新数据时,如何避免从S3中读取旧文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在2小时内,spark作业正在运行,以将某些tgz文件转换为镶木地板. 作业将新数据附加到s3中的现有镶木中:

Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3:

df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet")

在提交火花的输出中,我可以看到花了大量时间在读取旧的镶木地板文件上,例如:

In spark-submit output I can see significant time is being spent on reading old parquet files, for example:

16/11/27 14:06:15 INFO S3NativeFileSystem:打开's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70- 43f5-b8b4-50b5b4d0c7da.snappy.parquet"以供阅读

16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading

16/11/27 14:06:15 INFO S3NativeFileSystem:密钥流 'foo.parquet/id = 123/day = 2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet' 试图定位"149195444"

16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key 'foo.parquet/id=123/day=2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet' seeking to position '149195444'

此操作似乎每个文件花费不到1秒,但是文件数量随时间增加(每个追加都会添加新文件),这让我认为我的代码将无法扩展.

It looks like this operation takes less than 1 second per file, but the amount of files increases with time (each append adds new files), which makes me think that my code will not be able to scale.

有什么想法,如果我只需要添加新数据,如何避免从s3中读取旧的实木复合地板文件?

Any ideas how to avoid reading old parquet files from s3 if I just need to append new data?

我使用EMR 4.8.2和DirectParquetOutputCommitter:

I use EMR 4.8.2 and DirectParquetOutputCommitter:

sc._jsc.hadoopConfiguration().set('spark.sql.parquet.output.committer.class', 'org.apache.spark.sql.parquet.DirectParquetOutputCommitter')

推荐答案

我解决了此问题,方法是将数据帧写入EMR HDFS,然后使用s3-dist-cp将实木复合地板上载到S3

I resolved this issue by writing the dataframe to EMR HDFS and then using s3-dist-cp uploading the parquets to S3

这篇关于附加新数据时,如何避免从S3中读取旧文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆