生成实木复合地板文件-R和Python之间的区别 [英] Generating parquet files - differences between R and Python
问题描述
我们已经在Dask
(Python)和Drill
中生成了一个parquet
文件(R使用Sergeant
数据包).我们注意到了一些问题:
We have generated a parquet
file in Dask
(Python) and with Drill
(R using the Sergeant
packet ). We have noticed a few issues:
-
Dask
(即fastparquet
)的格式具有_metadata
和_common_metadata
文件,而R \ Drill
中的parquet
文件没有这些文件,而具有parquet.crc
文件(可以删除).这些parquet
实现之间有什么区别?
- The format of the
Dask
(i.e.fastparquet
) has a_metadata
and a_common_metadata
files while theparquet
file inR \ Drill
does not have these files and haveparquet.crc
files instead (which can be deleted). what is the difference between theseparquet
implementations?
推荐答案
(仅回答1),请发布单独的问题以使其更易于回答)
(only answering to 1), please post separate questions to make it easier to answer)
_metadata
和_common_metadata
是Parquet数据集不需要的帮助文件,Spark/Dask/Hive/...使用这些文件来推断数据集所有Parquet文件的元数据而无需读取所有文件的页脚.与此相反,Apache Drill在每个文件夹中(按需)生成一个类似文件,其中包含所有Parquet文件的所有页脚.仅在数据集上的第一个查询中,所有文件都被读取,其他查询将仅读取缓存所有页脚的文件.
_metadata
and _common_metadata
are helper files that are not required for a Parquet dataset, these ones are used by Spark/Dask/Hive/... to infer the metadata of all Parquet files of a dataset without the need to read the footer of all files. In constrast to this, Apache Drill generates a similar file in each folder (on demand) that contains all footers of all Parquet files. Only on the first query on a dataset all files are read, further queries will only read the file that caches all footers.
使用_metadata
和_common_metadata
的工具应该能够利用它们来加快执行时间,但不依赖于它们进行操作.如果不存在,则查询引擎只需读取所有页脚即可.
Tools using _metadata
and _common_metadata
should be able to leverage them to have faster execution times but not depend on them for operations. In the case that they are non-existent, the query engine then simply needs to read all footers.
这篇关于生成实木复合地板文件-R和Python之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!