如何处理大量实木复合地板文件 [英] How to deal with large number of parquet files

查看:76
本文介绍了如何处理大量实木复合地板文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Hadoop上使用Apache Parquet,过了一会儿我有一个顾虑.当我在Hadoop上的Spark中生成实木复合地板时,可能会变得非常混乱.当我说凌乱时,我的意思是Spark工作正在生成大量镶木地板文件.当我尝试查询它们时,我正在处理大量查询,因为Spark将所有文件合并在一起.

I'm using Apache Parquet on Hadoop and after a while I have one concern. When I'm generating parquets in Spark on Hadoop it can get pretty messy. When I say messy I mean that Spark job is genearing big amount of parquet files. When I try to query them I'm dealing with big time query because Spark is merging all the files together.

您能告诉我正确的处理方式吗,否则我可能会滥用它们?您已经处理过了,您如何解决?

Can you show me the right way to deal with it, or I'm maybe missusing them? Have you already dealt with it and how did u resolve it?

更新1: 将这些文件合并到一个实木复合地板中的一些辅助工作"是否足够好?首选使用什么大小的镶木地板文件,上下边界?

UPDATE 1: Is some "side job" for merging those files in one parquet good enough? What size of parquet files is prefered to use, some up and down boundaries?

推荐答案

看看 this GitHub存储库和 this 答案.简而言之,请保持文件大小大于HDFS块大小(128MB,256MB).

Take a look at this GitHub repo and this answer. In short keep size of the files larger than HDFS block size (128MB, 256MB).

这篇关于如何处理大量实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆