如何将小型实木复合地板文件合并为一个大型实木复合地板文件? [英] How to combine small parquet files to one large parquet file?

查看:74
本文介绍了如何将小型实木复合地板文件合并为一个大型实木复合地板文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些分区的蜂巢表,它们指向镶木地板文件.现在,每个分区都有很多小木地板文件,每个分区的大小约为5kb,我想将这些小文件合并为每个分区一个大文件.我该如何实现以提高蜂巢性能?我尝试将分区中的所有实木复合地板文件读取到pyspark数据帧,并将合并的数据帧重写到同一分区,然后删除旧的.但这出于某种原因对我来说似乎是低效的或初学者级别的类型.这样做的利弊是什么?而且,如果还有其他方法,请引导我以spark或pyspark来实现.

I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and rewriting the combined dataframe to the same partition and deleting the old ones. But this seems inefficient or beginner level type to me, for some reason. What are the pros and cons of doing it this way? And, if there are any other ways please guide me to achieve it in spark or pyspark.

推荐答案

您可以按分区读取全部数据,然后按 partitionBy 进行写入.(这是您将来也应该保存它们的方式).像这样:

You can read the whole data, repartition by the partitions you have and then write using the partitionBy (this is how you should also save them in the future). Something like:

spark\
    .read\
    .parquet('...'))\
    .repartition('key1', 'key2',...)\
    .write\
    .partitionBy('key1', 'key2',...)\
    .option('path', target_part)\
    .saveAsTable('partitioned')

这篇关于如何将小型实木复合地板文件合并为一个大型实木复合地板文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆