使用PyArrow从HDFS读取实木复合地板文件 [英] read a parquet files from HDFS using PyArrow
问题描述
我知道我可以通过pyarrow使用 pyarrow.hdfs.connect()
连接到HDFS集群。我也知道我可以使用 pyarrow.parquet
的
$ b
然而, read_table()
接受一个文件路径,而 hdfs.connect()
给我一个 HadoopFileSystem
实例。
是否可以使用pyarrow(安装了libhdfs3)来获取驻留在HDFS集群中的parquet文件/文件夹?我希望得到的是 to_pydict()
函数,然后我可以传递数据。
尝试
fs = pa.hdfs.connect(...)
fs.read_parquet('/ path / to / hdfs-file',** other_options)
或
导入pyarrow.parquet为pq
,其中fs.open(路径)为f:
pq.read_table (f,** read_options)
我打开 https://issues.apache.org/jira/browse/ARROW-1848 关于添加更多关于此
I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()
I also know I can read a parquet file using pyarrow.parquet
's read_table()
However, read_table()
accepts a filepath, whereas hdfs.connect()
gives me a HadoopFileSystem
instance.
Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict()
function, then I can pass the data along.
Try
fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)
or
import pyarrow.parquet as pq
with fs.open(path) as f:
pq.read_table(f, **read_options)
I opened https://issues.apache.org/jira/browse/ARROW-1848 about adding some more explicit documentation about this
这篇关于使用PyArrow从HDFS读取实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!