Python readlines()用法和有效的阅读实践 [英] Python readlines() usage and efficient practice for reading

查看:118
本文介绍了Python readlines()用法和有效的阅读实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在解析文件夹中的1000个文本文件(每个文件约400KB大小中约3000行)时遇到问题.我确实使用阅读线阅读了这些内容

I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder. I did read them using readlines,

   for filename in os.listdir (input_dir) :
       if filename.endswith(".gz"):
          f = gzip.open(file, 'rb')
       else:
          f = open(file, 'rb')

       file_content = f.readlines()
       f.close()
   len_file = len(file_content)
   while i < len_file:
       line = file_content[i].split(delimiter) 
       ... my logic ...  
       i += 1  

这对于我输入的样本(50,100个文件)完全可以正常工作.当我在整个输入中运行超过5K个文件时,所花费的时间远不及线性增量.我计划进行性能分析并进行Cprofile分析.当输入达到7K文件时,更多文件所花费的时间呈指数增长,并且达到更差的速度.

This works completely fine for sample from my inputs (50,100 files) . When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis. The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.

这是阅读行的累计时间, 首先-> 354个文件(来自输入的样本)和 秒-> 7473个文件(全部输入)

Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 354    0.192    0.001    **0.192**    0.001 {method 'readlines' of 'file' objects}
 7473 1329.380    0.178  **1329.380**    0.178 {method 'readlines' of 'file' objects}

因此,我的代码所花费的时间并不随输入的增加而线性缩放.我在readlines()上阅读了一些文档说明,人们声称该readlines()将整个文件内容读入内存,因此与readline()read()相比,通常消耗更多的内存.

Because of this, the time-taken by my code is not linearly scaling as the input increases. I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

我同意这一点,但是垃圾回收器是否应该在循环结束时自动从内存中清除已加载的内容,因此在任何时候我的内存应该只具有当前正在处理的文件的内容吗?但是,这里有一些问题.有人可以对此问题提供一些见解吗?

I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ? But, there is some catch here. Can somebody give some insights into this issue.

这是readlines()的固有行为还是我对python垃圾收集器的错误解释.很高兴知道.

Is this an inherent behavior of readlines() or my wrong interpretation of python garbage collector. Glad to know.

另外,建议一些替代的方式以节省内存和节省时间的方式进行相同操作. TIA.

Also, suggest some alternative ways of doing the same in memory and time efficient manner. TIA.

推荐答案

简短版本是:使用readlines()的有效方法是不使用它.

The short version is: The efficient way to use readlines() is to not use it. Ever.

我在readlines()上阅读了一些文档注释,人们声称该readlines()将整个文件内容读入内存,因此与readline()或read()相比,通常消耗更多的内存.

I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

readlines() 的文档保证将整个文件读入内存,并将其解析为行,然后从这些行中构建一个充满str ings的list.

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of strings out of those lines.

但是 read() 的文档同样保证它将整个文件读入内存,并构建str ing,因此无济于事.

But the documentation for read() likewise guarantees that it reads the whole file into memory, and builds a string, so that doesn't help.

除了使用更多的内存外,这还意味着您无法完成所有工作,直到完整地阅读了所有内容.如果您以最幼稚的方式交替进行读取和处理,您将至少受益于某些流水线化(这要归功于OS磁盘缓存,DMA,CPU管线等),因此您将在下一批处理下一批正在读取.但是,如果您强迫计算机读取整个文件,然后解析整个文件,然后运行代码,则只会得到整个文件的一个重叠工作区域,而不是每次读取一个重叠工作区域.

On top of using more memory, this also means you can't do any work until the whole thing is read. If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read. But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.

您可以通过三种方式解决此问题:

You can work around this in three ways:

  1. readlines(sizehint)read(size)readline()周围编写一个循环.
  2. 只需将文件用作延迟迭代器,而无需调用其中任何一个.
  3. mmap文件,它使您无需先读入文件即可将其视为巨型字符串.
  1. Write a loop around readlines(sizehint), read(size), or readline().
  2. Just use the file as a lazy iterator without calling any of these.
  3. mmap the file, which allows you to treat it as a giant string without first reading it in.

例如,这必须一次读取所有foo:

For example, this has to read all of foo at once:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

但这一次只能读取大约8K:

But this only reads about 8K at a time:

with open('foo') as f:
    while True:
        lines = f.readlines(8192)
        if not lines:
            break
        for line in lines:
            pass

并且这一次只能读取一行-尽管Python被允许(并且将)选择一个不错的缓冲区大小来使事情变得更快.

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.

with open('foo') as f:
    while True:
        line = f.readline()
        if not line:
            break
        pass

这将与上一个完全相同:

And this will do the exact same thing as the previous:

with open('foo') as f:
    for line in f:
        pass


同时:


Meanwhile:

但是垃圾收集器是否应该在循环结束时自动从内存中清除已加载的内容,因此在任何时候我的内存应该只具有当前正在处理的文件的内容正确吗?

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?

Python对垃圾回收没有任何保证.

Python doesn't make any such guarantees about garbage collection.

CPython实现恰好对GC使用了refcounting,这意味着在您的代码中,一旦file_content反弹或消失,巨大的字符串列表以及其中的所有字符串将被释放为空闲列表,这意味着相同的内存可以再次用于下一次遍历.

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass.

但是,所有这些分配,副本和解除分配都不是免费的-不执行它们比不执行它们要快得多.

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.

最重要的是,将字符串分散在一大堆内存中,而不是一遍又一遍地重复使用同一小块内存,这会损害您的缓存行为.

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.

此外,尽管内存使用量可能是恒定的(或者,最大文件的大小是线性的,而不是文件大小的总和),但是malloc的匆忙第一次扩展了它将会是您执行的最慢的操作之一(这也会使性能比较变得更加困难).

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of mallocs to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).

将所有内容放在一起,这就是我编写程序的方式:

Putting it all together, here's how I'd write your program:

for filename in os.listdir(input_dir):
    with open(filename, 'rb') as f:
        if filename.endswith(".gz"):
            f = gzip.open(fileobj=f)
        words = (line.split(delimiter) for line in f)
        ... my logic ...  

或者,也许:

for filename in os.listdir(input_dir):
    if filename.endswith(".gz"):
        f = gzip.open(filename, 'rb')
    else:
        f = open(filename, 'rb')
    with contextlib.closing(f):
        words = (line.split(delimiter) for line in f)
        ... my logic ...

这篇关于Python readlines()用法和有效的阅读实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆