生成实木复合地板文件-R和Python之间的区别 [英] Generating parquet files - differences between R and Python

查看:213
本文介绍了生成实木复合地板文件-R和Python之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们已经在Dask(Python)和Drill中生成了一个parquet文件(R使用Sergeant数据包).我们注意到了一些问题:

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues:

  1. Dask(即fastparquet)的格式具有_metadata_common_metadata文件,而R \ Drill中的parquet文件没有这些文件,而具有parquet.crc文件(可以删除).这些parquet实现之间有什么区别?
  1. The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations?

推荐答案

(仅回答1),请发布单独的问题以使其更易于回答)

(only answering to 1), please post separate questions to make it easier to answer)

_metadata_common_metadata是Parquet数据集不需要的帮助文件,Spark/Dask/Hive/...使用这些文件来推断数据集所有Parquet文件的元数据而无需读取所有文件的页脚.与此相反,Apache Drill在每个文件夹中(按需)生成一个类似文件,其中包含所有Parquet文件的所有页脚.仅在数据集上的第一个查询中,所有文件都被读取,其他查询将仅读取缓存所有页脚的文件.

_metadata and _common_metadata are helper files that are not required for a Parquet dataset, these ones are used by Spark/Dask/Hive/... to infer the metadata of all Parquet files of a dataset without the need to read the footer of all files. In constrast to this, Apache Drill generates a similar file in each folder (on demand) that contains all footers of all Parquet files. Only on the first query on a dataset all files are read, further queries will only read the file that caches all footers.

使用_metadata_common_metadata的工具应该能够利用它们来加快执行时间,但不依赖于它们进行操作.如果不存在,则查询引擎只需读取所有页脚即可.

Tools using _metadata and _common_metadata should be able to leverage them to have faster execution times but not depend on them for operations. In the case that they are non-existent, the query engine then simply needs to read all footers.

这篇关于生成实木复合地板文件-R和Python之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆