pandas 不能拼装到文件系统中,但会以变量的形式获取生成文件的内容 [英] Pandas to parquet NOT into file-system but get content of resulting file in variable

查看:66
本文介绍了 pandas 不能拼装到文件系统中,但会以变量的形式获取生成文件的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有几种方法可以实现从熊猫到实木复合地板的转换.例如pyarrow.Table.from_pandas或dataframe.to_parquet.它们的共同点是,它们将应存储df.parquet的filePath作为参数.

There are several ways how a conversion from pandas to parquet is possible. e.g. pyarrow.Table.from_pandas or dataframe.to_parquet . What they have in common is that they get as a parameter a filePath where the df.parquet should be stored.

我需要将实木复合地板文件的内容放入一个变量中,但还没有看到.主要是我希望具有与pandas.to_csv相同的行为,如果未提供路径,该行为将以字符串形式返回结果.

I need to get the content of the written parquet file into a variable and have not seen this, yet. Mainly I want the same behavior as pandas.to_csv which returns the result as a string if no path is provided.

当然,我可以编写文件,并通过标准的文件读取操作将其从python读取为字符串.当我正在写入大量数据时,这会在文件系统上产生很大的负担....

Of course I could just write the file and read it with standard file reading operations from python into a string. As I'm writing a ton of data, this would produce a lot of load on the file system ... .

推荐答案

您可以为此使用io.BytesIO,或者Apache Arrow也提供其本机实现BufferOutputStream.这样做的好处是,它无需使用Python即可产生写入流的开销.这样可以减少复制并释放GIL.

You can either use io.BytesIO for this or alternatively Apache Arrow also provides its native implementation BufferOutputStream. The benefit of this is that this writes to the stream without the overhead of going through Python. Thus less copies are made and the GIL is released.

import pyarrow as pa
import pyarrow.parquet as pq

df = some pandas.DataFrame
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf)
# buf now contains the Parquet file in memory.

这篇关于 pandas 不能拼装到文件系统中,但会以变量的形式获取生成文件的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆