从hdfs读取csv文件作为数据帧 [英] Reading in csv file as dataframe from hdfs

查看:923
本文介绍了从hdfs读取csv文件作为数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用pydoop从hdfs读取文件,并在使用时:

  import pydoop.hdfs as hd 
与hd.open(/ home / file.csv)作为f:
print f.read()

它向我显示stdout中的文件。

有什么方法可以让我以数据框的形式读入这个文件?我试过使用熊猫的read_csv(/ home / file.csv),但它告诉我无法找到该文件。确切的代码和错误是:

 >>>将pandas导入为pd 
>>> pd.read_csv(/ home / file.csv)
Traceback(最近一次调用的最后一个):
在< module>中,第1行的文件< stdin>
文件/usr/lib64/python2.7/site-packages/pandas/io/parsers.py,第498行,在parser_f
return _read(filepath_or_buffer,kwds)
文件 /usr/lib64/python2.7/site-packages/pandas/io/parsers.py,第275行,在_read
parser = TextFileReader(filepath_or_buffer,** kwds)
文件/ usr / lib64 / python2.7 / site-packages / pandas / io / parsers.py,第590行,位于__init__
self._make_engine(self.engine)
文件/usr/lib64/python2.7 /site-packages/pandas/io/parsers.py,第731行,在_make_engine
self._engine = CParserWrapper(self.f,** self.options)
文件/ usr / lib64 / python2.7 / site-packages / pandas / io / parsers.py,第1103行,在__init__
self._reader = _parser.TextReader(src,** kwds)
文件pandas / parser。 python,第353行,在pandas.parser.TextReader .__ cinit__(pandas / parser.c:3246)
在pandas.parser.TextReader._setup_parser_source中的第591行pandas / parser.pyx parser.c:6111)
IOError:文件/home/file.csv不存在st


解决方案

c> hdfs ,但我不知道以下内容是否可行:

  with hd.open( /home/file.csv)作为f:
df = pd.read_csv(f)

我假设 read_csv 与一个文件句柄一起工作,或者实际上是任何可以提供它的迭代。

pd.read_csv(/ home /如果常规的Python文件打开工作 - 也就是说它读取一个普通的本地文件,文件就可以工作。

b
$ b

  with open(/ home / file.csv)as f:
print f.read()
code>

但显然 hd.open 正在使用其他位置或协议,所以该文件不是本地的。如果我的建议不起作用,那么您(或我们)需要深入挖掘 hdfs 文档。


I'm using pydoop to read in a file from hdfs, and when I use:

import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
    print f.read()

It shows me the file in stdout.

Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:

>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
  File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist

解决方案

I know next to nothing about hdfs, but I wonder if the following might work:

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)

I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.

pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.

with open("/home/file.csv") as f: 
    print f.read()

But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.

这篇关于从hdfs读取csv文件作为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆