如何将 Pyspark 数据帧转换为 CSV 而不将其发送到文件? [英] How can I convert a Pyspark dataframe to a CSV without sending it to a file?
问题描述
我有一个数据框,需要将其转换为 CSV 文件,然后我需要将此 CSV 发送到 API.当我将它发送到 API 时,我不想将它保存到本地文件系统并需要将它保存在内存中.我该怎么做?
I have a dataframe which I need to convert to a CSV file, and then I need to send this CSV to an API. As I'm sending it to an API, I do not want to save it to the local filesystem and need to keep it in memory. How can I do this?
推荐答案
简单方法: 使用 toPandas()
将您的数据帧转换为 Pandas 数据帧,然后保存为字符串.要保存为字符串而不是文件,您必须使用 path_or_buf=None
调用 to_csv
.然后在 API 调用中发送字符串.
Easy way: convert your dataframe to Pandas dataframe with toPandas()
, then save to a string. To save to a string, not a file, you'll have to call to_csv
with path_or_buf=None
. Then send the string in an API call.
来自 to_csv() 文档:
参数
path_or_bufstr 或文件句柄,默认无
path_or_bufstr or file handle, default None
文件路径或对象,如果没有提供,则结果作为字符串返回.
File path or object, if None is provided the result is returned as a string.
因此您的代码可能如下所示:
So your code would likely look like this:
csv_string = df.toPandas().to_csv(path_or_bufstr=None)
替代方案: 使用 tempfile.SpooledTemporaryFile 使用大缓冲区创建内存文件.或者您甚至可以使用常规文件,只需将缓冲区变大足够了,不要刷新或关闭文件.查看Corey Goldberg 的解释,了解为什么会这样.
Alternatives: use tempfile.SpooledTemporaryFile with a large buffer to create an in-memory file. Or you can even use a regular file, just make your buffer large enough and don't flush or close the file. Take a look at Corey Goldberg's explanation of why this works.
这篇关于如何将 Pyspark 数据帧转换为 CSV 而不将其发送到文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!