如何在 Python 中便宜地获取大文件的行数? [英] How to get line count of a large file cheaply in Python?

查看:39
本文介绍了如何在 Python 中便宜地获取大文件的行数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在 python 中获取一个大文件(数十万行)的行数.在内存和时间方面最有效的方法是什么?

I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?

目前我这样做:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

有没有可能做得更好?

推荐答案

我不得不在一个类似的问题上发帖,直到我的声誉分数上升了一点(感谢谁撞了我!).

I had to post this on a similar question until my reputation score jumped a bit (thanks to whoever bumped me!).

所有这些解决方案都忽略了一种使运行速度大大加快的方法,即使用无缓冲(原始)接口、使用字节数组并进行自己的缓冲.(这仅适用于 Python 3.在 Python 2 中,默认情况下可能会或可能不会使用原始接口,但在 Python 3 中,您将默认使用 Unicode.)

All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)

使用计时工具的修改版本,我相信以下代码比提供的任何解决方案都更快(并且稍微更 Python 化):

Using a modified version of the timing tool, I believe the following code is faster (and marginally more pythonic) than any of the solutions offered:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'
')
        buf = read_f(buf_size)

    return lines

使用单独的生成器函数,运行速度更快:

Using a separate generator function, this runs a smidge faster:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'
') for buf in f_gen )

这可以完全通过使用 itertools 的生成器表达式来完成,但看起来很奇怪:

This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'
') for buf in bufgen )

这是我的时间安排:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

这篇关于如何在 Python 中便宜地获取大文件的行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆