导入文本文件:没有要从文件中解析的列 [英] Importing text file : No Columns to parse from file

查看:669
本文介绍了导入文本文件:没有要从文件中解析的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从sys.stdin中获取输入.这是针对hadoop的map reducer程序.输入文件为txt格式.数据集预览:

I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set:

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013
62  257 2   879372434
286 1014    5   879781125
200 222 5   876042340
210 40  3   891035994
224 29  3   888104457
303 785 3   879485318
122 387 5   879270459
194 274 2   879539794
291 1042    4   874834944

我一直在尝试的代码-

import sys
df = pd.read_csv(sys.stdin,error_bad_lines=False)

我也尝试过使用delimiter = \t, header=False,defining column name 似乎什么都没用,我得到的错误是以下错误:

I have also tried with delimiter = \t, header=False,defining column name Nothing seems to work, the error I am getting is this error:

[root@sandbox lab]# cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py
Traceback (most recent call last):
  File "/root/lab/mid-1-reducer.py", line 8, in <module>
    df = pd.read_csv(sys.stdin,delimiter='\t')
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 388, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 729, in __init__
    self._make_engine(self.engine)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 922, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/opt/rh/python27/root/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1389, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 538, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5896)
pandas.io.common.EmptyDataError: No columns to parse from file

但是,如果我直接在python中(而不是hadoop中)尝试此操作,则效果很好.

However, if when I try this directly in python(not in hadoop), it works fine.

我试图研究stackoverflow帖子,其中之一建议使用try和except.应用后,我得到的文件为空. 有人可以帮忙吗?谢谢

I have tried to looked into stackoverflow posts, one of the post suggested try and except. Applying that leaves me with a empty file. Can anybody help? Thanks

推荐答案

使用try和except可以使您继续在发生错误的情况下进行处理.它不会神奇地解决您的错误.

Using try and except just lets you continue in spite of errors and handle them. It won't magically fix your errors.

read_csv需要csv文件,您的输入显然不是.快速浏览文档:

read_csv expects csv files, which your input is obviously not. A quick look into the documentation:

delim_whitespace:布尔值,默认为False

delim_whitespace : boolean, default False

指定是否将空格(例如''或'')用作分隔符.等效于设置sep ='+ s'.如果将此选项设置为True,则分隔符参数不应传递任何内容.

Specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the sep. Equivalent to setting sep='+s'. If this option is set to True, nothing should be passed in for the delimiter parameter.

这似乎是正确的论点.使用

This seems like the right argument. Use

pandas.read_csv(filepath_or_buffer, delim_whitespace=True).

使用delimiter='\t'也应该起作用,除非选项卡被展开(由空格代替).正如我们不能真正说出的那样,delim_whitespace似乎是更好的选择.

Using delimiter='\t' should also work, unless the tabs are expanded (replaced by spaces). As we can't really tell, delim_whitespace seems to be the better option.

如果这样做没有帮助,只需打印出sys.stdin来检查您是否正确传递了文本.

If this doesn't help, just print out your sys.stdin to check if you properly pass the text.

我刚刚看到您使用

cat /root/lab/u.data | python /root/lab/mid-1-mapper.py |python /root/lab/mid-1-reducer.py

是否以此为目的,因此mid-1-reducer.py处理mid-1-mapper.py的输出.如果要处理文件u.data的内容,请考虑读取文件而不是sys.stdin.

Is this intended, this way mid-1-reducer.py processes the output of mid-1-mapper.py. If you want to process the content of the file u.data consider reading the file and not sys.stdin.

这篇关于导入文本文件:没有要从文件中解析的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆