Python pandas:读取文件跳过评论 [英] Python pandas: read file skipping commented

查看:155
本文介绍了Python pandas:读取文件跳过评论的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常处理包含少量列(通常少于10个)和高达数千万行的ascii表。它们看起来像

  176.792 -2.30523 0.430772 32016 1 1 2 
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1
...
177.403 -0.657246 0.432905 31152 1 1 1

我有许多读取,操作和保存文件的python代码。我一直用 numpy.loadtxt numpy.savetxt 来做。但是 numpy.loadtxt 需要至少5-6Gb的RAM来读取1Gb ascii文件。



昨天我发现了Pandas,这几乎解决了我所有的问题: pandas.read_table numpy.savetxt 提高了执行速度(2)我的脚本需要3或4倍,而且内存效率非常高。



一切顺利,直到我尝试读入一个包含一些注释行的文件。 doc字符串(v = 0.10.1.dev_f73128e)告诉我不支持行注释,这可能会出现。我认为这会很棒:我真的喜欢在 numpy.loadtxt 中排除行注释。
有什么想法可以使用吗?也很高兴有可能跳过这些行(文档声明它们将作为empy返回)



不知道我的文件中有多少注释行(我处理成千上万来自不同的人),因为现在我打开文件,计算以文件开头的注释开头的行数:

  def n_comments(fn,comment):
with open(fname,'r')as f:
n_lines = 0
pattern = re.compile( ^ \s * {0}。格式(注释))
for l in f:
if pattern.search(l)is None:
break
else:
n_lines + = 1
返回n_lines

然后

  pandas.read_table(fname,skiprows = n_comments(fname,'#'),header = None,sep ='\ s')

有没有更好的方法(可能在熊猫中)呢?



最后,在发布之前,我看了一下 pandas.io.parsers.py 中的代码,了解 pandas.read_table 在引擎盖下工作,但我迷路了。有人能指出我实施文件阅读的地方吗?



谢谢



EDIT2 :我想在@ThorstenKranz第二次实现 FileWrapper if >,但几乎没有任何改进

  class FileWrapper(file):
def __init __(self,comment_literal,* args):
super(FileWrapper,self).__ init __(* args)
self._comment_literal = comment_literal
self._next = self._next_comment

def next( self):
返回self._next()

def _next_comment(self):
而True:
line = super(FileWrapper,self).next()
如果不是line.strip()[0] == self._comment_literal:
self._next = self._next_no_comment
返回行
def _next_no_comment(self):
返回super(FileWrapper,self ).next()


解决方案

read_csv read_table 有一个 comment 选项,它将跳过从注释字符开始的字节,直到一行的结尾。如果需要跳过整个行,这是不对的,因为解析器会认为它看到一行没有字段,然后最终看到一个有效的数据行并感到困惑。 / p>

我建议使用您的解决方法来确定要在文件中手动跳过的行数。如果有一个选项可以在整行作为注释时自动跳过行,那将是很好的:



https://github.com/pydata/pandas/issues/2685



实施此井需要浸入C tokenizer代码。它没有听起来那么糟糕。


I often deal with ascii tables containing few columns (normally less than 10) and up to tens of millions of lines. They look like

176.792 -2.30523 0.430772 32016 1 1 2 
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
...
177.403 -0.657246 0.432905 31152 1 1 1

I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt and numpy.savetxt to do it. But numpy.loadtxt takes at least 5-6Gb RAM to read 1Gb ascii file.

Yesterday I discovered Pandas, that solved almost all my problems: pandas.read_table together with numpy.savetxt improved the execution speed (of 2) of my scripts by a factor 3 or 4, while being very memory efficient.

All good until the point when I try to read in a file that contains a few commented lines at the beginning. The doc string (v=0.10.1.dev_f73128e) tells me that line commenting is not supported, and that will probably come. I think that this would be great: I really like the exclusion of line comments in numpy.loadtxt. Is there any idea on how this will become available? Would be also nice to have the possibility to skip those lines (the doc states that they will be returned as empy)

Not knowing how many comment lines I have in my files (I process thousands of them coming from different people), as now I open the file, count the number of lines starting with a comment at the beginning of the file:

def n_comments(fn, comment):
    with open(fname, 'r') as f:
        n_lines = 0
        pattern = re.compile("^\s*{0}".format(comment))
        for l in f:
            if pattern.search(l) is None:
                break
            else:
                n_lines += 1
    return n_lines

and then

pandas.read_table(fname, skiprows=n_comments(fname, '#'), header=None, sep='\s')

Is there any better way (maybe within pandas) to do it?

Finally, before posting, I looked a bit at the code in pandas.io.parsers.py to understand how pandas.read_table works under the hood, but I got lost. Can anyone point me to the places that implement the reading of the files?

Thanks

EDIT2: I thought to get some improvement getting rid of some of the if in @ThorstenKranz second implementation of FileWrapper, but did get almost no improvements

class FileWrapper(file):
    def __init__(self, comment_literal, *args):
        super(FileWrapper, self).__init__(*args)
        self._comment_literal = comment_literal
        self._next = self._next_comment

    def next(self):
        return self._next()

    def _next_comment(self):
        while True:
            line = super(FileWrapper, self).next()
            if not line.strip()[0] == self._comment_literal:
                self._next = self._next_no_comment
                return line
    def _next_no_comment(self):
        return super(FileWrapper, self).next()

解决方案

read_csv and read_table have a comment option that will skip bytes starting from a comment character until the end of a line. If an entire line needs to be skipped, this isn't quite right because the parser will think that it's seen a line with no fields in it, then eventually see a valid data line and get confused.

I'd suggest using your workaround to determine the number of rows to skip manually in the file. It would be nice to have an option that enables automatically skipping lines when the entire line is a comment:

https://github.com/pydata/pandas/issues/2685

Implementing this well would require dipping into the C tokenizer code. It's not as bad as it might sound.

这篇关于Python pandas:读取文件跳过评论的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆