s3实木复合地板写入-分区过多，写入缓慢 [英] s3 parquet write - too many partitions, slow writing

查看：133 发布时间：2020/8/23 2:39:51 scala apache-spark amazon-s3 amazon-emr parquet

本文介绍了s3实木复合地板写入-分区过多，写入缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有我的scala spark工作，可以将其作为木地板文件写入s3.迄今为止，它的记录为60亿，并且将每天保持增长.根据用例，我们的api将根据id查询实木复合地板.因此，为了使查询结果更快，我正在用ID上的分区编写镶木地板.但是，我们具有1330360唯一ID，因此在写入时会创建1330360实木复合地板文件，因此写入步骤非常缓慢，写入时间超过9个小时，并且仍在运行.

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.

output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")

无论如何，我可以减少分区数量，但仍可以使读取查询更快?或任何其他更好的方式来处理这种情况?谢谢.

Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.

-id是一个带有随机数的整数列.

EDIT : - id is an integer column with random numbers.

s3实木复合地板写入-分区过多，写入缓慢 [英] s3 parquet write - too many partitions, slow writing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

s3实木复合地板写入-分区过多，写入缓慢 [英] s3 parquet write - too many partitions, slow writing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭