(Python)尽可能快地计算巨大(> 10GB)文件中的行 [英] (Python) Counting lines in a huge (>10GB) file as fast as possible

查看：133 发布时间：2020/7/24 5:38:12 python enumerate line-count

本文介绍了(Python)尽可能快地计算巨大(> 10GB)文件中的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我现在有一个非常简单的脚本，该脚本使用enumerate()对文本文件中的行进行计数:

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

这大约需要3分半钟才能查看15GB的日志文件，其中包含约3000万行.如果我能在两分钟或更短的时间内得到它，那就太好了，因为这些是每日日志，我们希望每月进行一次分析，因此该代码将必须处理30个约15GB的日志-可能需要一个半小时以上，我们希望尽量减少时间&服务器上的内存负载.

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

我也希望有一个好的近似/估计方法，但是它需要大约4 sig fig ...

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

谢谢！

推荐答案

Ignacio's answer is correct, but might fail if you have a 32 bit process.

但是也许逐块读取文件然后计算每个块中的\n个字符可能会很有用.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r") as f:
    print sum(bl.count("\n") for bl in blocks(f))

会做好您的工作.

请注意，我没有以二进制形式打开文件，因此\r\n将转换为\n，从而使计数更加可靠.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

对于Python 3，并使其更强大，以便读取具有各种字符的文件:

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
    print (sum(bl.count("\n") for bl in blocks(f)))

这篇关于(Python)尽可能快地计算巨大(> 10GB)文件中的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

(Python)尽可能快地计算巨大(> 10GB)文件中的行 [英] (Python) Counting lines in a huge (>10GB) file as fast as possible

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

(Python)尽可能快地计算巨大(> 10GB)文件中的行 [英] (Python) Counting lines in a huge (&gt;10GB) file as fast as possible

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

(Python)尽可能快地计算巨大(> 10GB)文件中的行 [英] (Python) Counting lines in a huge (>10GB) file as fast as possible

登录关闭