使用S3文件系统时，PYARROW会覆盖数据集 [英] Pyarrow overwrites dataset when using S3 filesystem

查看：0 发布时间：2022/7/19 22:46:25 parquet pyarrow

本文介绍了使用S3文件系统时，PYARROW会覆盖数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

将两个地块文件本地写入数据集时，Arrow能够适当地附加到分区。例如，如果我使用逐列箭头对两个文件进行分区，当我使用分区编写第一个地块文件时，ARROW会生成一个文件结构，其中的子文件夹对应于A列中的每个唯一值。当写入第二个文件时，ARROW足够智能，可以将数据写入正确的分区。因此，如果A列中的文件一和两个共享的公共值，我在子文件夹中看到具有公共值的两个单独的文件。代码示例：

df = pd.read_parquet('~/Desktop/rough/parquet_experiment/actual_07.parquet')
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, str(base +  "parquet_dataset_partition_combined"), 
                    partition_cols=['PartitionPoint'])

df = pd.read_parquet('~/Desktop/rough/parquet_experiment/actual_08.parquet')
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, str(base +  "parquet_dataset_partition_combined"), 
                    partition_cols=['PartitionPoint'])

这将导致：

由于PartitionColumn创建的两个文件夹的基数为2[A和B]，子文件夹PartitionPart=A有两个文件，因为文件Actual_07和Actual_08都对ParitionPart=A有贡献

但是，当我使用完全相同的代码，但使用S3作为我的文件系统时，就不会发生这种情况。代码如下：

from pyarrow import fs

s3  = fs.S3FileSystem(region="us-east-2")


df = pd.read_parquet('~/Desktop/rough/parquet_experiment/actual_07.parquet')
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, "parquet-storage", 
                    partition_cols=['PartitionPoint'],
                    filesystem=s3)

df = pd.read_parquet('~/Desktop/rough/parquet_experiment/actual_08.parquet')
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, "parquet-storage", 
                    partition_cols=['PartitionPoint'],
                   filesystem=s3)

相反，我发现第二个WRITE语句正在覆盖S3中的数据。每个PartitionPart=A文件夹一次始终只包含一个文件。这是将S3用作我的文件系统的警告吗？

推荐答案

此处的更改是隐式地从旧的数据集编写器切换到新的数据集编写器。默认情况下，pq.write_to_dataset将使用传统行为。但是，如果提供了文件系统(传统行为不支持此功能)，则它将使用新行为：

    if use_legacy_dataset is None:
        # if a new filesystem is passed -> default to new implementation
        if isinstance(filesystem, FileSystem):
            use_legacy_dataset = False
        # otherwise the default is still True
        else:
            use_legacy_dataset = True

遗留编写器的默认行为是使用GUID命名文件，因此如果您执行两次写入(每次写入包含每个文件夹的数据)，则每个文件夹中将获得两个文件。新编写器的默认beahvior使用计数器(例如part-{i}.extension)命名文件。这意味着多次写入可能会覆盖现有文件(因为计数器在每次调用write_to_dataset时都会重置)

通过pyarrow.dataset.write_dataset使用较新的数据集编写器获得此行为。您将需要使用basename_template参数，并为每次写入生成一个新的basename_template(一种简单的方法是将UUID附加到模板)。例如：

ds.write_dataset(table, '/tmp/mydataset', filesystem=s3,
  partitioning=partitioning, basename_template=str(uuid.uuid4()) + '-{i}',
  format='parquet')

迁移到新格式需要注意的几点：

format='parquet'-新的编写器支持写入多种文件格式，因此您需要指定PARQUET。
partitioning=partitioning-新的编写器具有更灵活的格式来指定分区模式。要获得旧的行为，您将需要分区的蜂窝风格：

import pyarrow.dataset as ds
# Note, you have to supply a schema here and not just a list of columns.
# However, this is hopefully changing in part of 6.0 so you can take
# an approach similar to the old style of just specifying column
# names (ARROW-13755).
partitioning = ds.partitioning(schema=pa.schema([pa.field('PartitioningPoint', type=pa.string())]))

这篇关于使用S3文件系统时，PYARROW会覆盖数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用S3文件系统时，PYARROW会覆盖数据集 [英] Pyarrow overwrites dataset when using S3 filesystem

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用S3文件系统时，PYARROW会覆盖数据集 [英] Pyarrow overwrites dataset when using S3 filesystem

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭