从pyarrow.Table转换为Pandas时处理大时间戳 [英] handling large timestamps when converting from pyarrow.Table to pandas

查看：36 发布时间：2022/5/11 22:22:30 python pandas timestamp parquet pyarrow

本文介绍了从pyarrow.Table转换为Pandas时处理大时间戳的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个时间戳9999-12-31 23:59:59作为int96存储在拼图文件中。我使用pyarrow.DataSet读取此拼图文件，并将结果表转换为 pandas 数据帧(使用pyarrow.Table.to_pandas())。转换为Pandas DataFrame会将我的时间戳转换为1816-03-30 05:56:07.066277376(Pandas时间戳的有效日期范围可能较小)，而不会对数据类型或任何内容进行任何说明。

然后，我获取这个 pandas 数据帧，将其转换回表，并使用pyarrow.datet.WriteDataSet将其写入镶嵌数据集

我现在得到的数据与我开始时使用的数据不同，没有看到任何警告。(当我试图从拼图数据集创建一个Imala表时，我发现了这一点，但后来无法正确查询它)。

从箭头表转换为 pandas 时，有没有办法处理这些大的时间戳？

我已尝试使用timestamp_as_object = True中的timestamp_as_object = True参数，但它似乎没有任何作用。

编辑：提供可复制的示例。问题是，在读取文件时，pyrow认为时间戳是纳秒，尽管它们被存储为微秒：

import pyarrow as pa
import pyarrow.dataset as ds
non_legacy_hdfs_filesystem = # connect to a filesystem here
my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
parquet_format = ds.ParquetFileFormat()
write_options = parquet_format.make_write_options(use_deprecated_int96_timestamps = True, coerce_timestamps = 'us', allow_truncated_timestamps = True)
ds.write_dataset(data = my_table, base_dir = 'my_path', filesystem = non_legacy_hdfs_filesystem, format = parquet_format, file_options = write_options, partitioning= None)

dataset = ds.dataset('my_path', filesystem = non_legacy_hdfs_filesystem)
dataset.to_table().column('my_timestamps')

推荐答案

我的理解是，您的数据已使用use_deprecated_int96_timestamps=True保存。

import pyarrow as pa
import pyarrow.parquet as pq


my_table = pa.Table.from_arrays([pa.array(['9999-12-31', '9999-12-31', '9999-12-31']).cast('timestamp[us]')], names = ['my_timestamps'])
pq.write_table(my_table, '/tmp/table.pq',  use_deprecated_int96_timestamps=True)

在此模式下，时间戳使用96位整数保存，分辨率为(默认/硬编码)纳秒。

>>> pq.read_metadata('/tmp/table.pq').schema[0]
<ParquetColumnSchema>
  name: my_timestamps
  path: my_timestamps
  max_definition_level: 1
  max_repetition_level: 0
  physical_type: INT96
  logical_type: None
  converted_type (legacy): NONE

在最新版本的ARROW/PARQUET中，时间戳为64位整数，分辨率可配置。

应该可以使用微秒分辨率将传统的96位纳秒时间戳转换为64位整数，而不会丢失信息。但不幸的是，parquet reader中没有允许您这样做的选项(据我所知)。

您可能必须提出拼花/箭头的问题，但我认为他们正在努力尝试弃用96位整数enter link description here。

这篇关于从pyarrow.Table转换为Pandas时处理大时间戳的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从pyarrow.Table转换为Pandas时处理大时间戳 [英] handling large timestamps when converting from pyarrow.Table to pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从pyarrow.Table转换为Pandas时处理大时间戳 [英] handling large timestamps when converting from pyarrow.Table to pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭