如何使用 pyarrow 编写 Parquet 元数据? [英] How to write Parquet metadata with pyarrow?

查看:110
本文介绍了如何使用 pyarrow 编写 Parquet 元数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 pyarrow 创建和分析包含生物信息的 Parquet 表,我需要存储一些元数据,例如数据来自哪个样本,数据是如何获得和处理的.

I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed.

Parquet 似乎支持 file-宽元数据,但我找不到如何通过pyarrow编写它.我能找到的最接近的是如何编写行组元数据,但这似乎有点矫枉过正,因为我的元数据对于文件中的所有行组都是相同的.

Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file.

有没有办法用 pyarrow 编写文件范围的 Parquet 元数据?

Is there any way to write file-wide Parquet metadata with pyarrow?

推荐答案

此示例说明如何使用 PyArrow 创建包含文件元数据和列元数据的 Parquet 文件.

This example shows how to create a Parquet file with file metadata and column metadata with PyArrow.

假设您有以下 CSV 数据:

Suppose you have the following CSV data:

movie,release_year
three idiots,2009
her,2013

将 CSV 读入 PyArrow 表并使用列/文件元数据定义自定义架构:

Read the CSV into a PyArrow table and define a custom schema with column / file metadata:

import pyarrow.csv as pv
import pyarrow.parquet as pq
import pyarrow as pa

table = pv.read_csv('movies.csv')

my_schema = pa.schema([
    pa.field("movie", "string", False, metadata={"spanish": "pelicula"}),
    pa.field("release_year", "int64", True, metadata={"portuguese": "ano"})],
    metadata={"great_music": "reggaeton"})

使用 my_schema 创建一个新表并将其写成 Parquet 文件:

Create a new table with my_schema and write it out as a Parquet file:

t2 = table.cast(my_schema)

pq.write_table(t2, 'movies.parquet')

读取 Parquet 文件并获取文件元数据:

Read the Parquet file and fetch the file metadata:

s = pq.read_table('movies.parquet').schema

s.metadata # => {b'great_music': b'reggaeton'}
s.metadata[b'great_music'] # => b'reggaeton'

获取与 release_year 列关联的元数据:

Fetch the metadata associated with the release_year column:

parquet_file.schema.field('release_year').metadata[b'portuguese'] # => b'ano'

这篇关于如何使用 pyarrow 编写 Parquet 元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆