pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗？ [英] Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

查看：2175 发布时间：2018/5/31 19:58:43 python hadoop parquet apache-drill pyarrow

本文介绍了pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数百万条记录SQL表，我打算使用pyarrow库向一个文件夹中的许多parquet文件写出数据。数据内容似乎太大，无法存储在单个实木复合地板文件中。

然而，我似乎无法通过pyarrow库找到一个API或参数，它允许我指定如下所示：

  file_scheme =hive

由fastparquet python库支持。

以下是我的示例代码：
＃！/ usr / bin / python 导入pyodbc 将pandas导入为pd 将pyarrow导入为pa 进口pyarrow.parquet如PQ conn_str = 'UID =用户名; PWD = passwordHere;' + 'DRIVER = freetds的; SERVERNAME =的myconfig; DATABASE = MYDB' ＃---->查询SQL数据库到数据帧的熊猫康恩= pyodbc.connect（conn_str，自动提交= FALSE） SQL = SELECT * FROM ClientAccount（NOLOCK） DF = pd.io.sql .read_sql（sql，conn）＃---->将数据框转换为pyarrow表并将其写出 table = pa.Table.from_pandas（df） pq.write_table（table，'./clients/'）这会引发错误：文件/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py，第912行，写入表 os.remove（其中） OSError：[Errno 21]是一个目录：'./clients/' 如果我用下面的代码替换最后一行，很好，但只写了一个大文件： pq.write_table（table，'./clients.parquet'）任何想法如何使用pyarrow执行多文件输出？解决方案试试 pyarrow.parquet.write_to_dataset https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L938 。我打开了 https：/ /issues.apache.org/jira/browse/ARROW-1858 有关添加更多文档的信息。我建议寻求对Apache Arrow的支持。邮件列表dev@arrow.apache.org。谢谢！ I have a multi-million record SQL table that I'm planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file. However, I can't seem to find an API or parameter with the pyarrow library that allows me to specify something like: file_scheme="hive" As is supported by the fastparquet python library. Here's my sample code: #!/usr/bin/python import pyodbc import pandas as pd import pyarrow as pa import pyarrow.parquet as pq conn_str = 'UID=username;PWD=passwordHere;' + 'DRIVER=FreeTDS;SERVERNAME=myConfig;DATABASE=myDB' #----> Query the SQL database into a Pandas dataframe conn = pyodbc.connect( conn_str, autocommit=False) sql = "SELECT * FROM ClientAccount (NOLOCK)" df = pd.io.sql.read_sql(sql, conn) #----> Convert the dataframe to a pyarrow table and write it out table = pa.Table.from_pandas(df) pq.write_table(table, './clients/' ) This throws an error: File "/usr/local/lib/python2.7/dist-packages/pyarrow/parquet.py", line 912, in write_table os.remove(where) OSError: [Errno 21] Is a directory: './clients/' If I replace that last line with the following, it works fine but writes only one big file: pq.write_table(table, './clients.parquet' ) Any ideas how I can do the multi-file output thing with pyarrow? 解决方案 Try pyarrow.parquet.write_to_dataset https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L938. I opened https://issues.apache.org/jira/browse/ARROW-1858 about adding some more documentation about this. I recommend seeking support for Apache Arrow on the mailing list dev@arrow.apache.org. Thanks! 这篇关于pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗？ [英] Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗？ [英] Can pyarrow write multiple parquet files to a folder like fastparquet&#39;s file_scheme=&#39;hive&#39; option?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

pyarrow可以将多个parquet文件写入像fastparquet的file_scheme ='hive'选项这样的文件夹吗？ [英] Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

登录关闭