如何使用Pyarrow实现流写入效果 [英] How to use Pyarrow to achieve stream writing effect

查看:65
本文介绍了如何使用Pyarrow实现流写入效果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我拥有的数据是一种流数据.我想将它们存储到一个 Parquet 文件中.但是 Pyarrow 每次都会覆盖 Parquet 文件.那我该怎么办?

The data I have is a kind of streaming data. And I want to store them into a single Parquet file. But Pyarrow will overwrite the Parquet file everytime. So How should I do?

我尽量不关闭编写器,但似乎不可能,因为如果我不关闭它,那么我将无法读取此文件.

I try not to close the writer but it seems impossible since If I didn't close it, then I could not read this file.

这是包:

import pyarrow.parquet as pp
import pyarrow as pa

for name in ['LEE','LSY','asd','wer']:
    writer=pq.ParquetWriter('d:/test.parquet', table.schema)
    arrays=[pa.array([name]),pa.array([2])]
    field=[pa.field('name',pa.string()),pa.field('age',pa.int64())]
    table=pa.Table.from_arrays(arrays,schema=pa.schema(field))
    writer.write_table(table)
writer.close()

但实际上我每次都想关闭编写器,然后重新打开它以在数据后添加一行,如下所示:

But actually I want to close the writer everytime, and reopen it to append one line to the data which like this:

for name in ['LEE','LSY','asd','wer']:
    writer=pq.ParquetWriter('d:/test.parquet', table.schema)
    arrays=[pa.array([name]),pa.array([2])]
    field=[pa.field('name',pa.string()),pa.field('age',pa.int64())]
    table=pa.Table.from_arrays(arrays,schema=pa.schema(field))
    writer.write_table(table)
    writer.close()

推荐答案

Parquet 文件一旦写入就无法追加.这种情况的典型解决方案是每次写入一个新的parquet文件(可以一起形成一个单独的分区parquet数据集),或者,如果数据不多,先将python中的数据收集到一个表中,然后再写入一次.

Parquet files cannot be appended once they are written. The typical solution for this case to write a new parquet file each time (which can together form a single partitioned parquet dataset), or, if it is not much data, first gather the data in python into a single table and then write once.

请参阅此电子邮件线程并对其进行更多讨论:https://lists.apache.org/thread.html/07b1e3f13b5dae7e34ee3752f3cd4d16a94deb3a5f43893b73475900@%3Cdev.arrow.apache.org%3E

See this email thread with some more discussion about it: https://lists.apache.org/thread.html/07b1e3f13b5dae7e34ee3752f3cd4d16a94deb3a5f43893b73475900@%3Cdev.arrow.apache.org%3E

这篇关于如何使用Pyarrow实现流写入效果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆