在open()的缓冲参数和迭代文件时使用硬编码的readahead缓冲区大小之间有什么区别? [英] What is the difference between the buffering argument to open() and the hardcoded readahead buffer size used when iterating through a file?

查看:145
本文介绍了在open()的缓冲参数和迭代文件时使用硬编码的readahead缓冲区大小之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

的启发问题,我想知道Python的 open()函数的可选缓冲参数是什么。从查看,我发现传递给 setvbuf 来设置流的缓冲区大小(并且在没有<$ c $的系统上不做任何事情c> setvbuf ,文档确认)。

然而,当你遍历一个文件时,有一个常量叫做 READAHEAD_BUFSIZE ,它似乎定义了一次读取多少数据(这个常量被定义为缓冲

参数涉及 READAHEAD_BUFSIZE 。当我遍历一个文件时,哪一个定义了多少数据一次从磁盘读取?在C源代码中有这样的地方吗?

解决方案

READAHEAD_BUFSIZE 仅用于

  for file in line :
print line

它是一个独立的缓冲区,由 fread C API调用。两者都在迭代时使用。




为了使循环成为循环遍历文件行的最有效方式(一个非常常见的操作), next()方法使用隐藏的预读缓冲区。由于使用预读缓冲区,将 next()与其他文件方法(如 readline() )不能正常工作。但是,使用 seek()将文件重新定位到绝对位置将刷新预读缓冲区。

OS缓冲区大小没有改变, setvbuf 是在打开文件时完成的,文件迭代代码没有触及。相反,调用 Py_UniversalNewlineFread (它使用 fread )来填充预读缓冲区,创建一个第二个缓冲区内部的Python。 Python保留了常规的缓冲区,直到C API调用( fread()调用被缓冲;用户空间缓冲区由 fread() code>来满足请求,Python不必做任何事情)。
$ b

readahead_get_line_skip()然后从这个缓冲区提供行(换行符终止)。如果缓冲区不再包含换行符,它将通过缓冲区大小为前一个值1.25倍的缓冲区重新填充缓冲区。这意味着如果整个文件中没有更多的换行符,文件迭代就可以将整个文件的其余部分读入内存缓冲区!



要查看缓冲区读取,打印文件位置(使用 fileobj.tell())while循环:

 >>>打开('test.txt')为f:
...为f:
...打印f.tell()
...
8192# 1倍的缓冲区大小
8192
8192
〜行消失
18432#+ 1.25倍缓冲区大小
18432
18432
〜行消失
26624#+缓冲区大小的1倍;最后一个换行符必须与缓冲区边界对齐
26624
26624
〜行消失
36864#+ 1.25倍缓冲区大小
36864
36864

etc。

从磁盘读取(提供 fileobj 是磁盘上的实际物理文件)不仅取决于 fread()缓冲区和内部预读缓冲区;而且如果操作系统本身正在使用缓冲。很可能,即使文件缓冲区已经耗尽,操作系统也会通过系统调用来从文件中自己的缓存中读取数据,而不是进入物理磁盘。


Inspired by this question, I'm wondering exactly what the optional buffering argument to Python's open() function does. From looking at the source, I see that buffering is passed into setvbuf to set the buffer size for the stream (and that it does nothing on a system without setvbuf, which the docs confirm).

However, when you iterate over a file, there is a constant called READAHEAD_BUFSIZE that appears to define how much data is read at a time (this constant is defined here).

My question is exactly how the buffering argument relates to READAHEAD_BUFSIZE. When I iterate through a file, which one defines how much data is being read off disk at a time? And is there a place in the C source that makes this clear?

解决方案

READAHEAD_BUFSIZE is only used when you use the file as an iterator:

for line in fileobj:
    print line

It is a separate buffer from the normal buffer argument, which is handled by the fread C API calls. Both are used when iterating.

From file.next():

In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right. However, using seek() to reposition the file to an absolute position will flush the read-ahead buffer.

The OS buffer size is not changed, the setvbuf is done when the file is opened and not touched by the file iteration code. Instead, calls to Py_UniversalNewlineFread (which uses fread) are used to fill the read-ahead buffer, creating a second buffer internal to Python. Python otherwise leaves the regular buffering up to the C API calls (fread() calls are buffered; the userspace buffer is consulted by fread() to satisfy the request, Python doesn't have to do anything about that).

readahead_get_line_skip() then serves lines (newline terminated) from this buffer. If the buffer no longer contains newlines, it'll refill the buffer by recursing over itself with a buffer size 1.25 times the previous value. This means that file iteration can read the whole rest of the file into the memory buffer if there are no more newline characters in the whole file!

To see how much the buffer reads, print the file position (using fileobj.tell()) while looping:

>>> with open('test.txt') as f:
...     for line in f:
...         print f.tell()
... 
8192   # 1 times the buffer size
8192
8192
~ lines elided
18432  # + 1.25 times the buffer size
18432
18432
~ lines elided
26624  # + 1 times the buffer size; the last newline must've aligned on the buffer boundary
26624
26624
~ lines elided
36864  # + 1.25 times the buffer size
36864
36864

etc.

What bytes are actually read from the disk (provided fileobj is an actual physical file on your disk) depend not only on the interplay between the fread() buffer and the internal read-ahead buffer; but also if the OS itself is using buffering. It could well be that even if the file buffer is exhausted, the OS serves the system call to read from the file from it's own cache instead of going to the physical disk.

这篇关于在open()的缓冲参数和迭代文件时使用硬编码的readahead缓冲区大小之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆