从特定位置的二进制文件读取整数的性能问题 [英] Performance issue with reading integers from a binary file at specific locations

查看:83
本文介绍了从特定位置的二进制文件读取整数的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个整数存储为二进制的文件,我正在尝试提取特定位置的值.这是一个大的序列化整数数组,为此我需要特定索引处的值.我已经创建了以下代码,但是与之前创建的F#版本相比,它的运行速度非常慢.

import os, struct

def read_values(filename, indices):
    # indices are sorted and unique
    values = []
    with open(filename, 'rb') as f:
        for index in indices:
            f.seek(index*4L, os.SEEK_SET)
            b = f.read(4)
            v = struct.unpack("@i", b)[0]
            values.append(v)
    return values

为进行比较,这里是F#版本:

open System
open System.IO

let readValue (reader:BinaryReader) cellIndex = 
    // set stream to correct location
    reader.BaseStream.Position <- cellIndex*4L
    match reader.ReadInt32() with
    | Int32.MinValue -> None
    | v -> Some(v)

let readValues fileName indices = 
    use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
    let values = List.map (readValue reader) (List.ofSeq indices)
    values

关于如何提高python版本性能的任何提示,例如通过使用numpy吗?

更新

Hdf5效果很好(在我的测试文件上从5秒到0.8秒):

import tables
def read_values_hdf5(filename, indices):
    values = []
    with tables.open_file(filename) as f:
        dset = f.root.raster
        return dset[indices]

更新2

我使用np.memmap是因为性能类似于hdf5,并且我已经在生产中使用了numpy.

解决方案

很大程度上取决于索引文件的大​​小,您可能希望将其完全读入numpy数组中.如果文件不大,则完整的顺序读取可能比大量查找要快.

seek操作的一个问题是python对缓冲输入进行操作.如果该程序是用某种较低级的语言编写的,那么在无缓冲IO上的使用将是一个好主意,因为您只需要几个值即可.

import numpy as np

# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]

如果您仍然要阅读几乎所有页面(即您的索引是随机的,且频率为1/1000或更高),则此速度可能会更快.另一方面,如果您有一个较大的索引文件,并且只想选择几个索引,则速度不是很快.

然后(可能是最快的)另一种可能性是使用python mmap模块.然后将文件映射到内存,然后仅访问真正需要的页面.

应该是这样的:

import mmap

with open("my_index", "rb") as f:
    memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
    for i in indices:
        # the index at position i:
        idx_value = struct.unpack('I', memory_map[4*i:4*i+4])

(请注意,我实际上没有测试过那个,所以可能会键入错误.而且,我也不关心字节序,因此请检查它是否正确.)

很高兴,可以使用numpy.memmap将它们组合在一起.它应该将阵列保留在磁盘上,但会给您带来麻木的索引.它应该像这样简单:

import numpy as np

index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]

我认为这应该是最简单,最快的选择.但是,如果快速"很重要,请进行测试和配置.


随着mmap解决方案似乎越来越流行,我将在内存映射文件中添加一些文字.

什么是mmap?

内存映射文件不是唯一的pythonic文件,因为内存映射是POSIX标准中定义的内容.内存映射是一种使用设备或文件的方式,就好像它们只是内存中的区域一样.

文件内存映射是随机访问固定长度数据文件的一种非常有效的方法.它使用与虚拟内存相同的技术.读写是普通的内存操作.如果它们指向不在物理RAM内存中的内存位置(发生页面错误"),则所需的文件块(页面)将被读取到内存中.

随机文件访问的延迟主要归因于磁盘的物理旋转(SSD是另一回事).平均而言,您需要的块距旋转了一半.对于典型的HDD,此延迟约为5毫秒加上任何数据处理延迟.与这种延迟相比,使用python而不是编译语言引入的开销可以忽略不计.

如果按顺序读取文件,则操作系统通常会在您甚至不需要之前使用预读缓存对文件进行缓冲.对于随机访问的大文件,这根本没有帮助.内存映射提供了一种非常有效的方法,因为所有块均在需要时准确加载,并保留在缓存中以备将来使用. (fseek原则上也可能发生这种情况,因为它可能在幕后使用相同的技术.但是,不能保证,并且无论如何,调用都会在操作系统中徘徊,因此会产生一些开销.)

mmap也可以用于写入文件.从单个进程映射可以共享一个内存映射文件的角度来看,这是非常灵活的.在某些情况下,这可能非常有用和高效,并且mmap也可以用于进程间通信.在这种情况下,通常没有为mmap指定文件,而是创建了内存映射,并且没有文件.

尽管

mmap有用且相对易用,但并不是非常知名.但是,它有一个重要的陷阱".文件大小必须保持恒定.如果在mmap期间发生更改,则可能会发生奇怪的事情.

I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.

import os, struct

def read_values(filename, indices):
    # indices are sorted and unique
    values = []
    with open(filename, 'rb') as f:
        for index in indices:
            f.seek(index*4L, os.SEEK_SET)
            b = f.read(4)
            v = struct.unpack("@i", b)[0]
            values.append(v)
    return values

For comparison here is the F# version:

open System
open System.IO

let readValue (reader:BinaryReader) cellIndex = 
    // set stream to correct location
    reader.BaseStream.Position <- cellIndex*4L
    match reader.ReadInt32() with
    | Int32.MinValue -> None
    | v -> Some(v)

let readValues fileName indices = 
    use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    // Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
    let values = List.map (readValue reader) (List.ofSeq indices)
    values

Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?

Update

Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):

import tables
def read_values_hdf5(filename, indices):
    values = []
    with tables.open_file(filename) as f:
        dset = f.root.raster
        return dset[indices]

Update 2

I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.

解决方案

Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.

One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.

import numpy as np

# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]

If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.

Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.

It should be something like this:

import mmap

with open("my_index", "rb") as f:
    memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
    for i in indices:
        # the index at position i:
        idx_value = struct.unpack('I', memory_map[4*i:4*i+4])

(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)

Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:

import numpy as np

index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]

I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.


EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.

What is mmap?

Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.

File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.

The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.

If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)

mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.

mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.

这篇关于从特定位置的二进制文件读取整数的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆