有效合并大型实木复合地板文件 [英] Effectively merge big parquet files

查看：101 发布时间：2020/11/22 2:51:49 hadoop parquet

本文介绍了有效合并大型实木复合地板文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用实木复合地板工具来合并实木复合地板文件.但似乎镶木地板工具需要的存储量与合并文件一样大.在镶木地板工具中，我们还有其他方法或可配置选项可以更有效地使用内存吗?因为我在hadoop env上将合并作业作为映射作业运行.容器每次因其使用的内存多于提供的内存而被杀死.

I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided.

谢谢.

推荐答案

我不建议使用parquet-tools merge，因为它只是将行组一个接一个地放置，所以您仍然会有小的组，只是打包在一起在一个文件中.生成的文件通常将不会具有明显更好的性能，并且在某些情况下，其性能甚至可能比单独的文件差.有关详细信息，请参见 PARQUET-1115 .

I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so you will still have small groups, just packed together in a single file. The resulting file will typically not have noticably better performance, and under certain circumstances it may even perform worse than separate files. See PARQUET-1115 for details.

当前，合并Parquet文件的唯一正确方法是从文件中读取所有数据并将其写入新的Parquet文件.您可以通过MapReduce作业(为此目的需要编写自定义代码)或使用Spark，Hive或Impala来实现.

Currently the only proper way to merge Parquet files is to read all data from them and write it to a new Parquet file. You can do it with a MapReduce job (requires writing custom code for this purpose) or using Spark, Hive or Impala.

这篇关于有效合并大型实木复合地板文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有效合并大型实木复合地板文件 [英] Effectively merge big parquet files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有效合并大型实木复合地板文件 [英] Effectively merge big parquet files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭