使用PyArrow从HDFS读取实木复合地板文件 [英] read a parquet files from HDFS using PyArrow

查看：1949 发布时间：2018/6/6 11:12:59 hdfs parquet pyarrow

本文介绍了使用PyArrow从HDFS读取实木复合地板文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道我可以通过pyarrow使用 pyarrow.hdfs.connect（）

连接到HDFS集群。我也知道我可以使用 pyarrow.parquet 的 read_table（）

读取parquet文件。
$ b

然而， read_table（）接受一个文件路径，而 hdfs.connect（）给我一个 HadoopFileSystem 实例。

是否可以使用pyarrow（安装了libhdfs3）来获取驻留在HDFS集群中的parquet文件/文件夹？我希望得到的是 to_pydict（）函数，然后我可以传递数据。

解决方案

尝试

  fs = pa.hdfs.connect（...）
 fs.read_parquet（'/ path / to / hdfs-file'，** other_options）

或

 导入pyarrow.parquet为pq 
，其中fs.open（路径）为f：
 pq.read_table （f，** read_options）

我打开 https://issues.apache.org/jira/browse/ARROW-1848 关于添加更多关于此

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()

I also know I can read a parquet file using pyarrow.parquet's read_table()

However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance.

Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along.
解决方案
Try
fs = pa.hdfs.connect(...) fs.read_parquet('/path/to/hdfs-file', **other_options)
or
import pyarrow.parquet as pq with fs.open(path) as f: pq.read_table(f, **read_options)
I opened https://issues.apache.org/jira/browse/ARROW-1848 about adding some more explicit documentation about this

这篇关于使用PyArrow从HDFS读取实木复合地板文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PyArrow从HDFS读取实木复合地板文件 [英] read a parquet files from HDFS using PyArrow

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用PyArrow从HDFS读取实木复合地板文件 [英] read a parquet files from HDFS using PyArrow

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭