从PyArrow编写Parquet文件时如何指定逻辑类型? [英] How to specify logical types when writing Parquet files from PyArrow?

查看:360
本文介绍了从PyArrow编写Parquet文件时如何指定逻辑类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PyArrow 编写 Parquet 文件"nofollow noreferrer" title ="Pandas"> Pandas Python中的数据框.

I'm using PyArrow to write Parquet files from some Pandas dataframes in Python.

有没有一种方法可以指定写入镶木地板文件的逻辑类型?

Is there a way that I can specify the logical types that are written to the parquet file?

例如,在PyArrow中写入np.uint32列将导致镶木地板文件中的INT64列,而使用 fastparquet 模块将生成一个INT32列,其逻辑类型为UINT_32(这是我从PyArrow获得的行为).

For for example, writing an np.uint32 column in PyArrow results in an INT64 column in the parquet file, whereas writing the same using the fastparquet module results in an INT32 column with a logical type of UINT_32 (this is the behaviour I'm after from PyArrow).

例如:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp
import numpy as np

df = pd.DataFrame.from_records(data=[(1, 'foo'), (2, 'bar')], columns=['id', 'name'])
df['id'] = df['id'].astype(np.uint32)

# write parquet file using PyArrow
pq.write_table(pa.Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet')

# write parquet file using fastparquet
fp.write('fastparquet.parquet', df)

# print schemas of both written files
print('PyArrow:', pq.ParquetFile('pyarrow.parquet').schema)
print('fastparquet:', pq.ParquetFile('fastparquet.parquet').schema)

此输出:

PyArrow: <pyarrow._parquet.ParquetSchema object at 0x10ecf9048>
id: INT64
name: BYTE_ARRAY UTF8

fastparquet: <pyarrow._parquet.ParquetSchema object at 0x10f322848>
id: INT32 UINT_32
name: BYTE_ARRAY UTF8

我在其他列类型上也遇到类似的问题,因此,我真的在寻找一种通用方法来指定使用PyArrow编写时使用的逻辑类型.

I'm having similar issues with other column types, so really looking for a generic way to specify the logical types that are used when writing using PyArrow.

推荐答案

PyArrow默认情况下默认编写镶木地板版本1.0文件,而使用UINT_32逻辑类型则需要版本2.0.

PyArrow defaults to writing parquet version 1.0 files by default, and version 2.0 is needed to use the UINT_32 logical type.

解决方案是在写表时指定版本,即

The solution is to specify the version when writing the table, i.e.

pq.write_table(pa.Table.from_pandas(df, preserve_index=False), 'pyarrow.parquet', version='2.0')

这将导致预期的实木复合地板架构被写入.

This then results in the expected parquet schema being written.

这篇关于从PyArrow编写Parquet文件时如何指定逻辑类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆