有效合并大型实木复合地板文件 [英] Effectively merge big parquet files

查看:101
本文介绍了有效合并大型实木复合地板文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用实木复合地板工具来合并实木复合地板文件.但似乎镶木地板工具需要的存储量与合并文件一样大.在镶木地板工具中,我们还有其他方法或可配置选项可以更有效地使用内存吗?因为我在hadoop env上将合并作业作为映射作业运行.容器每次因其使用的内存多于提供的内存而被杀死.

I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided.

谢谢.

推荐答案

我不建议使用parquet-tools merge,因为它只是将行组一个接一个地放置,所以您仍然会有小的组,只是打包在一起在一个文件中.生成的文件通常将不会具有明显更好的性能,并且在某些情况下,其性能甚至可能比单独的文件差.有关详细信息,请参见 PARQUET-1115 .

I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so you will still have small groups, just packed together in a single file. The resulting file will typically not have noticably better performance, and under certain circumstances it may even perform worse than separate files. See PARQUET-1115 for details.

当前,合并Parquet文件的唯一正确方法是从文件中读取所有数据并将其写入新的Parquet文件.您可以通过MapReduce作业(为此目的需要编写自定义代码)或使用Spark,Hive或Impala来实现.

Currently the only proper way to merge Parquet files is to read all data from them and write it to a new Parquet file. You can do it with a MapReduce job (requires writing custom code for this purpose) or using Spark, Hive or Impala.

这篇关于有效合并大型实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆