pandas 读取不带标头的csv(可能在那里) [英] Pandas read csv without header (which might be there)

查看:103
本文介绍了 pandas 读取不带标头的csv(可能在那里)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分块读取一个.csv文件(python引擎),并跳过标题(或任何以注释字符开头的行). 先验尚不知道文件是否具有标题,因此无法跳过第一行,因为它可能已经是数据行.

I'm trying to read a .csv file in chunks (python-engine) and skip the header (or any lines starting with a comment character). It is not known a priori if the file has a header or not, so it is not possible to just skip the first line, since it might already be a data row.

设置header=None确实可以解决问题.如果我调用get_chunk并想要行值,我仍然会获得标题/注释行.

Setting header=None does solve the problem. If I invoke get_chunk and want the row values, I still get the header/or comment line.

所需的输出将与numpy.loadtxt()

下面的代码演示了正在发生的事情:

The code below demonstrates what's going on:

import numpy as np
from pandas.io.parsers import TextFileReader
fn = '/tmp/test.csv'
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
print np.loadtxt(fn).shape # output (100,3)

reader = TextFileReader(fn, chunksize=10, header=None)
reader.get_chunk().values

# output
array([['#', 'makes', 'no', 'sense'],
       ['0.000000000000000000e+00', '1.000000000000000000e+00',
        '2.000000000000000000e+00', None],
       ['3.000000000000000000e+00', '4.000000000000000000e+00',
        '5.000000000000000000e+00', None],
       ['6.000000000000000000e+00', '7.000000000000000000e+00',
        '8.000000000000000000e+00', None],
       ['9.000000000000000000e+00', '1.000000000000000000e+01',
        '1.100000000000000000e+01', None],
       ['1.200000000000000000e+01', '1.300000000000000000e+01',
        '1.400000000000000000e+01', None],
       ['1.500000000000000000e+01', '1.600000000000000000e+01',
        '1.700000000000000000e+01', None],
       ['1.800000000000000000e+01', '1.900000000000000000e+01',
        '2.000000000000000000e+01', None],
       ['2.100000000000000000e+01', '2.200000000000000000e+01',
        '2.300000000000000000e+01', None],
       ['2.400000000000000000e+01', '2.500000000000000000e+01',
        '2.600000000000000000e+01', None]], dtype=object)

如果我通过指定注释字符

If I specify the comment char via

   reader = TextFileReader(fn, chunksize=10, header=None, comment='#')

我得到一个例外:

In [99]: reader = pandas.io.parsers.TextFileReader('/tmp/test.csv', chunksize=10, header=None, index_col=False, comment="#")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-99-64b1c0bce4ef> in <module>()
----> 1 reader = pandas.io.parsers.TextFileReader('/tmp/test.csv', chunksize=10, header=None, index_col=False, comment="#")

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    560             self.options['has_index_names'] = kwds['has_index_names']
    561 
--> 562         self._make_engine(self.engine)
    563 
    564     def _get_options_with_defaults(self, engine):

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    703             elif engine == 'python-fwf':
    704                 klass = FixedWidthFieldParser
--> 705             self._engine = klass(self.f, **self.options)
    706 
    707     def _failover_to_python(self):

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1400         # Set self.data to something that can read lines.
   1401         if hasattr(f, 'readline'):
-> 1402             self._make_reader(f)
   1403         else:
   1404             self.data = f

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_reader(self, f)
   1505                 self.pos += 1
   1506                 self.line_pos += 1
-> 1507                 sniffed = csv.Sniffer().sniff(line)
   1508                 dia.delimiter = sniffed.delimiter
   1509                 if self.encoding is not None:

/home/marscher/anaconda/lib/python2.7/csv.pyc in sniff(self, sample, delimiters)
    180 
    181         quotechar, doublequote, delimiter, skipinitialspace = \
--> 182                    self._guess_quote_and_delimiter(sample, delimiters)
    183         if not delimiter:
    184             delimiter, skipinitialspace = self._guess_delimiter(sample,

/home/marscher/anaconda/lib/python2.7/csv.pyc in _guess_quote_and_delimiter(self, data, delimiters)
    221                       '(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
    222             regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
--> 223             matches = regexp.findall(data)
    224             if matches:
    225                 break

TypeError: expected string or buffer

编辑,此错误是由于未将评论包装在列表中引起的.

Edit this error is caused by not wrapping comment in a list.

推荐答案

我知道这太老了,而且我从不知道您的注释错误是怎么回事(您对问题的澄清并没有解决我的问题) ,但我认为这与调用类而不是函数有关),但是有一些修改提供了我认为您正在寻找的输出.

I know this is super old, and I never figured out what's going on with your comment error (and your clarification of the problem didn't fix it for me, but I think it has something to do with calling a class rather than a function), but several modifications provide the output I think you're looking for.

首先,如果您告诉读者没有标题,它将把任何标题行解释为数据,从而确定读取的数据的形状和类型(例如,数字的字符串格式).它可以推断是否有标题,而不用弄乱形状,而将注释留为一个单独的问题.

First, if you tell the reader there is no header, it will interpret any header lines as data, determining both the shape and type of data read in (e.g., string format for numbers). It can infer whether there is a header, to not screw up the shape, leaving comments as a separate issue.

import numpy as np
from pandas.io.parsers import TextFileReader
fn = '/tmp/test.csv'
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
np.loadtxt(fn).shape # output (100,3)

reader = TextFileReader(fn, chunksize=10, header='infer')
reader.get_chunk().values

#output, just inferring headers
array([[  0.,   1.,   2.,  nan],
   [  3.,   4.,   5.,  nan],
   [  6.,   7.,   8.,  nan],
   [  9.,  10.,  11.,  nan],
   [ 12.,  13.,  14.,  nan],
   [ 15.,  16.,  17.,  nan],
   [ 18.,  19.,  20.,  nan],
   [ 21.,  22.,  23.,  nan],
   [ 24.,  25.,  26.,  nan],
   [ 27.,  28.,  29.,  nan]])

nan是将注释行解释为标题(尽管也注释掉了)的标题,该标题包含四个部分.

The nan comes from interpreting the commented line as a header (which it is, though also commented out), which has four parts.

您可以通过更改文本的保存方式来消除标题上的注释标记.

You can get rid of the comment mark on the header by changing how you save the text.

np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no      sense",comments=None)
reader = TextFileReader(fn, chunksize=10, header='infer')
reader.get_chunk().values
#output, without true header commented out
array([[  0.,   1.,   2.],
   [  3.,   4.,   5.],
   [  6.,   7.,   8.],
   [  9.,  10.,  11.],
   [ 12.,  13.,  14.],
   [ 15.,  16.,  17.],
   [ 18.,  19.,  20.],
   [ 21.,  22.,  23.],
   [ 24.,  25.,  26.],
   [ 27.,  28.,  29.]])

这消除了带注释的标题的问题,但无助于推断正确的形状,或者如果您有真实的注释,您也想忽略.

This eliminates the problem with the commented out header, but doesn't help to infer the correct shape, or if you have real comments you also want to ignore.

如果您想推断是否有标题,并且也忽略任何注释行,那么我只能通过调用一个函数来弄清楚该怎么做.

If you want to infer whether there is a header, and also ignore any commented lines, I can only figure out how to do that by calling a function.

import pandas
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
reader = pandas.read_csv(fn,chunksize=10,header='infer',comment="#")
reader.get_chunk().values
#output, treating the header as a comment, so shape is decided by first data line
array([[ '3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00'],
   [ '6.000000000000000000e+00 7.000000000000000000e+00 8.000000000000000000e+00'],
   [ '9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01'],
   [ '1.200000000000000000e+01 1.300000000000000000e+01 1.400000000000000000e+01'],
   [ '1.500000000000000000e+01 1.600000000000000000e+01 1.700000000000000000e+01'],
   [ '1.800000000000000000e+01 1.900000000000000000e+01 2.000000000000000000e+01'],
   [ '2.100000000000000000e+01 2.200000000000000000e+01 2.300000000000000000e+01'],
   [ '2.400000000000000000e+01 2.500000000000000000e+01 2.600000000000000000e+01'],
   [ '2.700000000000000000e+01 2.800000000000000000e+01 2.900000000000000000e+01'],
   [ '3.000000000000000000e+01 3.100000000000000000e+01 3.200000000000000000e+01']], dtype=object)

#Or, without the commented out header
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense",comments='')
reader = pandas.read_csv(fn,chunksize=10,header='infer',comment="#")
reader.get_chunk().values
#output, treating the header as a header to determine shape, but comments would also be ignored
array([[ '0.000000000000000000e+00 1.000000000000000000e+00 2.000000000000000000e+00'],
   [ '3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00'],
   [ '6.000000000000000000e+00 7.000000000000000000e+00 8.000000000000000000e+00'],
   [ '9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01'],
   [ '1.200000000000000000e+01 1.300000000000000000e+01 1.400000000000000000e+01'],
   [ '1.500000000000000000e+01 1.600000000000000000e+01 1.700000000000000000e+01'],
   [ '1.800000000000000000e+01 1.900000000000000000e+01 2.000000000000000000e+01'],
   [ '2.100000000000000000e+01 2.200000000000000000e+01 2.300000000000000000e+01'],
   [ '2.400000000000000000e+01 2.500000000000000000e+01 2.600000000000000000e+01'],
   [ '2.700000000000000000e+01 2.800000000000000000e+01 2.900000000000000000e+01']], dtype=object)

这篇关于 pandas 读取不带标头的csv(可能在那里)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆