如何处理大量实木复合地板文件 [英] How to deal with large number of parquet files

查看：76 发布时间：2020/9/4 7:41:07 hadoop apache-spark streaming parquet file-type

本文介绍了如何处理大量实木复合地板文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Hadoop上使用Apache Parquet，过了一会儿我有一个顾虑.当我在Hadoop上的Spark中生成实木复合地板时，可能会变得非常混乱.当我说凌乱时，我的意思是Spark工作正在生成大量镶木地板文件.当我尝试查询它们时，我正在处理大量查询，因为Spark将所有文件合并在一起.

I'm using Apache Parquet on Hadoop and after a while I have one concern. When I'm generating parquets in Spark on Hadoop it can get pretty messy. When I say messy I mean that Spark job is genearing big amount of parquet files. When I try to query them I'm dealing with big time query because Spark is merging all the files together.

您能告诉我正确的处理方式吗，否则我可能会滥用它们?您已经处理过了，您如何解决?

Can you show me the right way to deal with it, or I'm maybe missusing them? Have you already dealt with it and how did u resolve it?

更新1: 将这些文件合并到一个实木复合地板中的一些辅助工作"是否足够好?首选使用什么大小的镶木地板文件，上下边界?

UPDATE 1: Is some "side job" for merging those files in one parquet good enough? What size of parquet files is prefered to use, some up and down boundaries?

如何处理大量实木复合地板文件 [英] How to deal with large number of parquet files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何处理大量实木复合地板文件 [英] How to deal with large number of parquet files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭