Python readlines() 用法和高效阅读练习 [英] Python readlines() usage and efficient practice for reading

查看:20
本文介绍了Python readlines() 用法和高效阅读练习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在解析文件夹中的 1000 个文本文件(每个文件中大约 3000 行,大约 400KB 大小)时遇到问题.我确实使用 readlines 阅读了它们,

I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder. I did read them using readlines,

   for filename in os.listdir (input_dir) :
       if filename.endswith(".gz"):
          f = gzip.open(file, 'rb')
       else:
          f = open(file, 'rb')

       file_content = f.readlines()
       f.close()
   len_file = len(file_content)
   while i < len_file:
       line = file_content[i].split(delimiter) 
       ... my logic ...  
       i += 1  

这对于来自我的输入(50,100 个文件)的样本完全正常.当我在整个输入上运行超过 5K 个文件时,所花费的时间远不及线性增量.我打算做一个性能分析,并做了一个 Cprofile 分析.当输入达到 7K 文件时,更多文件所花费的时间呈指数增长并达到更差的速率.

This works completely fine for sample from my inputs (50,100 files) . When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis. The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.

这是 readlines 的累积时间,第一个 -> 354 个文件(来自输入的样本)和第二个 -> 7473 个文件(整个输入)

Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 354    0.192    0.001    **0.192**    0.001 {method 'readlines' of 'file' objects}
 7473 1329.380    0.178  **1329.380**    0.178 {method 'readlines' of 'file' objects}

因此,我的代码所花费的时间不会随着输入的增加而线性缩放.我阅读了一些关于 readlines() 的文档说明,人们声称这个 readlines() 将整个文件内容读入内存,因此与 相比通常消耗更多的内存readline()read().

Because of this, the time-taken by my code is not linearly scaling as the input increases. I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

我同意这一点,但是垃圾收集器是否应该在循环结束时自动清除内存中加载的内容,因此在任何时候我的内存都应该只有我当前处理的文件的内容,对吗?但是,这里有一些问题.有人可以对这个问题提供一些见解.

I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ? But, there is some catch here. Can somebody give some insights into this issue.

这是 readlines() 的固有行为还是我对 python 垃圾收集器的错误解释.很高兴知道.

Is this an inherent behavior of readlines() or my wrong interpretation of python garbage collector. Glad to know.

另外,建议一些在内存和时间效率方面做同样事情的替代方法.TIA.

Also, suggest some alternative ways of doing the same in memory and time efficient manner. TIA.

推荐答案

简短版本是:使用 readlines() 的有效方法是不使用它.曾经.

The short version is: The efficient way to use readlines() is to not use it. Ever.

我阅读了一些关于 readlines() 的文档说明,人们声称这个 readlines() 将整个文件内容读入内存,因此与readline() 或 read().

I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

readlines() 明确保证将整个文件读入内存,并解析成行,并构建出一个充满strlist那些行.

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of strings out of those lines.

但是 read() 的文档 同样保证它将整个文件读入内存,并构建一个 string,所以这无济于事.

But the documentation for read() likewise guarantees that it reads the whole file into memory, and builds a string, so that doesn't help.

除了使用更多内存之外,这也意味着在读取整个内容之前您无法进行任何工作.如果您以最幼稚的方式交替读取和处理,您将至少受益于一些流水线(感谢 OS 磁盘缓存、DMA、CPU 流水线等),因此您将在处理一个批次的同时处理下一个批次正在阅读.但是如果你强制计算机读入整个文件,然后解析整个文件,然后运行你的代码,你只会得到整个文件的一个区域重叠工作,而不是每次读取一个区域重叠工作.

On top of using more memory, this also means you can't do any work until the whole thing is read. If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read. But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.

您可以通过三种方式解决此问题:

You can work around this in three ways:

  1. 围绕readlines(sizehint)read(size)readline()编写一个循环.
  2. 只需将该文件用作惰性迭代器,而无需调用其中任何一个.
  3. mmap 文件,它允许您将其视为一个巨大的字符串,而无需先读入.
  1. Write a loop around readlines(sizehint), read(size), or readline().
  2. Just use the file as a lazy iterator without calling any of these.
  3. mmap the file, which allows you to treat it as a giant string without first reading it in.

例如,这必须一次读取所有foo:

For example, this has to read all of foo at once:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

但这一次只能读取大约 8K:

But this only reads about 8K at a time:

with open('foo') as f:
    while True:
        lines = f.readlines(8192)
        if not lines:
            break
        for line in lines:
            pass

而且这一次只能读取一行——尽管 Python 被允许(并且将会)选择一个合适的缓冲区大小来加快速度.

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.

with open('foo') as f:
    while True:
        line = f.readline()
        if not line:
            break
        pass

这将做与之前完全相同的事情:

And this will do the exact same thing as the previous:

with open('foo') as f:
    for line in f:
        pass

<小时>

同时:

但是垃圾收集器是否应该在循环结束时自动清除内存中加载的内容,因此在任何时候我的内存都应该只有我当前处理的文件的内容,对吗?

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?

Python 不对垃圾收集做出任何此类保证.

Python doesn't make any such guarantees about garbage collection.

CPython 实现碰巧使用 GC 的引用计数,这意味着在您的代码中,一旦 file_content 反弹或消失,巨大的字符串列表以及其中的所有字符串, 将被释放到空闲列表中,这意味着相同的内存可以再次用于您的下一次传递.

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass.

然而,所有这些分配、复制和释放都不是免费的——不做比做要快得多.

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.

最重要的是,让您的字符串分散在一大片内存中而不是一遍又一遍地重复使用相同的小块内存会损害您的缓存行为.

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.

另外,虽然内存使用量可能是恒定的(或者,更确切地说,与最大文件的大小呈线性关系,而不是文件大小的总和),但 malloc 的激增第一次扩展它会是你做的最慢的事情之一(这也使得进行性能比较变得更加困难).

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of mallocs to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).

综合起来,这就是我编写您的程序的方式:

Putting it all together, here's how I'd write your program:

for filename in os.listdir(input_dir):
    with open(filename, 'rb') as f:
        if filename.endswith(".gz"):
            f = gzip.open(fileobj=f)
        words = (line.split(delimiter) for line in f)
        ... my logic ...  

或者,也许:

for filename in os.listdir(input_dir):
    if filename.endswith(".gz"):
        f = gzip.open(filename, 'rb')
    else:
        f = open(filename, 'rb')
    with contextlib.closing(f):
        words = (line.split(delimiter) for line in f)
        ... my logic ...

这篇关于Python readlines() 用法和高效阅读练习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆