读/写大二进制文件时的首选块大小 [英] Preferred block size when reading/writing big binary files

查看:119
本文介绍了读/写大二进制文件时的首选块大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要读取和写入巨大的二进制文件.是否应该一次read()设置一个首选的甚至最佳的字节数(我称之为BLOCK_SIZE)?

I need to read and write huge binary files. Is there a preferred or even optimal number of bytes (what I call BLOCK_SIZE) I should read() at a time?

一个字节肯定太少了,我也不认为向RAM读取4 GB也是一个好主意-是否有最佳"块大小?还是甚至取决于文件系统(我在 ext4 上)?我需要考虑什么?

One byte is certainly too little, and I do not think reading 4 GB into the RAM is a good idea either - is there a 'best' block size? or does that even depend on the file-system (I'm on ext4)? What do I need to consider?

Python的open() 甚至提供了buffering参数.我也需要调整吗?

Python's open() even provides a buffering argument. Would I need to tweak that as well?

这是仅将两个文件in-0.datain-1.data连接到out.data的示例代码(在现实生活中,还有更多与手头问题无关的处理).选择BLOCK_SIZE等于 io.DEFAULT_BUFFER_SIZE 似乎是buffering的默认设置:

This is sample code that just joins the two files in-0.data and in-1.data into out.data (in real life there is more processing that is irrelevant to the question at hand). The BLOCK_SIZE is chosen equal to io.DEFAULT_BUFFER_SIZE which seems to be the default for buffering:

from pathlib import Path
from functools import partial

DATA_PATH = Path(__file__).parent / '../data/'

out_path = DATA_PATH / 'out.data'
in_paths = (DATA_PATH / 'in-0.data', DATA_PATH / 'in-1.data')

BLOCK_SIZE = 8192

def process(data):
    pass

with out_path.open('wb') as out_file:
    for in_path in in_paths:
        with in_path.open('rb') as in_file:
            for data in iter(partial(in_file.read, BLOCK_SIZE), b''):
                process(data)
                out_file.write(data)
#            while True:
#                data = in_file.read(BLOCK_SIZE)
#                if not data:
#                    break
#                process(data)
#                out_file.write(data)

推荐答案

让操作系统为您做出决定.使用mmap模块:

Let the OS make the decision for you. Use the mmap module:

https://docs.python.org/3.4/library/mmap.html

它使用操作系统的基础内存映射机制将文件内容映射到RAM.

It uses your OS's underlying memory mapping mechanism for mapping the contents of a file into RAM.

请注意,如果您使用的是32位Python,则文件大小限制为2GB,因此,如果您决定采用这种方式,请确保使用64位版本.

Be aware that there's a 2GB file size limit if you're using 32-bit Python, so be sure to use the 64-bit version if you decide to go this route.

例如:

f1 = open('input_file', 'r+b')
m1 = mmap.mmap(f1.fileno(), 0)
f2 = open('out_file', 'a+b') # out_file must be >0 bytes on windows
m2 = mmap.mmap(f2.fileno(), 0)
m2.resize(len(m1))
m2[:] = m1 # copy input_file to out_file
m2.flush() # flush results

请注意,您不必调用任何read()函数并决定要带入RAM的字节数.此示例仅将一个文件复制到另一个文件中,但是正如您在示例中所说的,您可以在这两个文件之间进行所需的任何处理.请注意,虽然整个文件都映射到RAM中的地址空间,但这并不意味着它实际上已被复制到其中.操作系统会酌情将其分段复制.

Note that you never had to call any read() functions and decide how many bytes to bring into RAM. This example just copies one file into another, but as you said in your example, you can do whatever processing you need in between. Note that while the entire file is mapped to an address space in RAM, that doesn't mean it has actually been copied there. It will be copied piecewise, at the discretion of the OS.

这篇关于读/写大二进制文件时的首选块大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆