s3实木复合地板写入-分区过多,写入缓慢 [英] s3 parquet write - too many partitions, slow writing

查看:133
本文介绍了s3实木复合地板写入-分区过多,写入缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我的scala spark工作,可以将其作为木地板文件写入s3.迄今为止,它的记录为60亿,并且将每天保持增长.根据用例,我们的api将根据id查询实木复合地板.因此,为了使查询结果更快,我正在用ID上的分区编写镶木地板.但是,我们具有1330360唯一ID,因此在写入时会创建1330360实木复合地板文件,因此写入步骤非常缓慢,写入时间超过9个小时,并且仍在运行.

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.

output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")

无论如何,我可以减少分区数量,但仍可以使读取查询更快?或任何其他更好的方式来处理这种情况?谢谢.

Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.

-id是一个带有随机数的整数列.

EDIT : - id is an integer column with random numbers.

推荐答案

您可以按ID范围进行分区(您没有对ID进行任何说明,因此我无法提出具体建议)和/或使用存储桶来代替分区 https://www.slideshare.net/TejasPatil1/hive-bucketing -in-apache-spark

you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark

这篇关于s3实木复合地板写入-分区过多,写入缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆