pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗? [英] Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

查看:2175
本文介绍了pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数百万条记录SQL表,我打算使用pyarrow库向一个文件夹中的许多parquet文件写出数据。数据内容似乎太大,无法存储在单个实木复合地板文件中。



然而,我似乎无法通过pyarrow库找到一个API或参数,它允许我指定如下所示:

  file_scheme =hive

由fastparquet python库支持。



以下是我的示例代码:

 #!/ usr / bin / python 

导入pyodbc
将pandas导入为pd
将pyarrow导入为pa
进口pyarrow.parquet如PQ

conn_str = 'UID =用户名; PWD = passwordHere;' +
'DRIVER = freetds的; SERVERNAME =的myconfig; DATABASE = MYDB'

#---->查询SQL数据库到数据帧的熊猫
康恩= pyodbc.connect(conn_str,自动提交= FALSE)
SQL = SELECT * FROM ClientAccount(NOLOCK)
DF = pd.io.sql .read_sql(sql,conn)


#---->将数据框转换为pyarrow表并将其写出
table = pa.Table.from_pandas(df)
pq.write_table(table,'./clients/')



这会引发错误:

 文件/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py,第912行,写入表
os.remove(其中)
OSError:[Errno 21]是一个目录:'./clients/'

如果我用下面的代码替换最后一行,很好,但只写了一个大文件:

  pq.write_table(table,'./clients.parquet')

任何想法如何使用pyarrow执行多文件输出?

解决方案

试试 pyarrow.parquet.write_to_dataset https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L938



我打开了 https:/ /issues.apache.org/jira/browse/ARROW-1858 有关添加更多文档的信息。



我建议寻求对Apache Arrow的支持。邮件列表dev@arrow.apache.org。谢谢!


I have a multi-million record SQL table that I'm planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file.

However, I can't seem to find an API or parameter with the pyarrow library that allows me to specify something like:

file_scheme="hive"

As is supported by the fastparquet python library.

Here's my sample code:

#!/usr/bin/python

import pyodbc
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

conn_str = 'UID=username;PWD=passwordHere;' + 
    'DRIVER=FreeTDS;SERVERNAME=myConfig;DATABASE=myDB'

#----> Query the SQL database into a Pandas dataframe
conn = pyodbc.connect( conn_str, autocommit=False)
sql = "SELECT * FROM ClientAccount (NOLOCK)"
df = pd.io.sql.read_sql(sql, conn)


#----> Convert the dataframe to a pyarrow table and write it out
table = pa.Table.from_pandas(df)
pq.write_table(table, './clients/' )

This throws an error:

File "/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py", line 912, in write_table
    os.remove(where)
OSError: [Errno 21] Is a directory: './clients/'

If I replace that last line with the following, it works fine but writes only one big file:

pq.write_table(table, './clients.parquet' )

Any ideas how I can do the multi-file output thing with pyarrow?

解决方案

Try pyarrow.parquet.write_to_dataset https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L938.

I opened https://issues.apache.org/jira/browse/ARROW-1858 about adding some more documentation about this.

I recommend seeking support for Apache Arrow on the mailing list dev@arrow.apache.org. Thanks!

这篇关于pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆