Python mmap-缓慢访问文件结尾(带有测试代码) [英] Python mmap - slow access to end of files [with test code]

查看:60
本文介绍了Python mmap-缓慢访问文件结尾(带有测试代码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几天前,我发布了一个类似的问题,但没有任何代码,现在我创建了一个测试代码,以期寻求帮助.

I posted a similar question a few days ago but without any code, now I created a test code in hopes of getting some help.

代码位于底部.

我有一些数据集,其中有一堆大文件(〜100个),我想从这些文件中高效地提取特定行(在内存和速度上).

I got some dataset where I have a bunch of large files (~100) and I want to extract specific lines from those files very efficiently (both in memory and in speed).

我的代码获取相关文件的列表,代码使用[第1行]打开每个文件,然后使用[第2行]将文件映射到内存,此外,对于每个文件,我都会收到一个索引列表并遍历索引我检索相关信息(此示例为10个字节),如下所示:[第3-4行],最后我用[第5-6行]关闭了句柄.

My code gets a list of relevant files, the code opens each file with [line 1], then maps the file to memory with [line 2], also, for each file I receives a list of indices and going over the indices I retrieve the relevant information (10 bytes for this example) like so: [line 3-4], finally I close the handles with [line 5-6].

binaryFile = open(path, "r+b")
binaryFile_mm = mmap.mmap(binaryFile.fileno(), 0)
for INDEX in INDEXES:
    information = binaryFile_mm[(INDEX):(INDEX)+10].decode("utf-8")
binaryFile_mm.close()
binaryFile.close()

此代码并行运行,每个文件具有数千个索引,并连续数小时连续数小时这样做.

This codes runs in parallel, with thousands of indices for each file, and continuously do that several times a second for hours.

现在有问题-当我将索引限制为较小时,代码运行良好(意思是-当我要求代码从文件开头获取信息时).但!当我增加索引范围时,一切都会减速到(几乎)停止,并且buff/cache内存已满(我不确定内存问题是否与减速有关).

Now to the problem - The code runs well when I limit the indices to be small (meaning - when I ask the code to get information from the beginning of the file). But! when I increase the range of the indices, everything slows down to (almost) a halt AND the buff/cache memory gets full (I'm not sure if the memory issue is related to the slowdown).

所以我的问题是,为什么从文件的开头或结尾检索信息有什么关系?如何克服这一点,以便在不减慢和增加增益的情况下从文件的结尾即时访问信息? /缓存内存使用.

So my question is why does it matter if I retrieve information from the beginning or the end of the file and how do I overcome this in order to get instant access to information from the end of the file without slowing down and increasing buff/cache memory use.

PS-一些数字和大小:因此我得到了约100个文件,每个文件的大小约为1GB,当我将索引限制为文件的0%-10%时,它运行良好,但是当我允许索引为在文件停止运行的任何位置.

PS - some numbers and sizes: so I got ~100 files each about 1GB in size, when I limit the indices to be from the 0%-10% of the file it runs fine, but when I allow the index to be anywhere in the file it stops working.

代码-在Linux和Windows上使用python 3.5进行了测试,需要10 GB的存储空间(创建3个文件,每个文件的随机字符串在3GB之内)

Code - tested on linux and windows with python 3.5, requires 10 GB of storage (creates 3 files with random strings inside 3GB each)

import os, errno, sys
import random, time
import mmap



def create_binary_test_file():
    print("Creating files with 3,000,000,000 characters, takes a few seconds...")
    test_binary_file1 = open("test_binary_file1.testbin", "wb")
    test_binary_file2 = open("test_binary_file2.testbin", "wb")
    test_binary_file3 = open("test_binary_file3.testbin", "wb")
    for i in range(1000):
        if i % 100 == 0 :
            print("progress -  ", i/10, " % ")
        # efficiently create random strings and write to files
        tbl = bytes.maketrans(bytearray(range(256)),
                          bytearray([ord(b'a') + b % 26 for b in range(256)]))
        random_string = (os.urandom(3000000).translate(tbl))
        test_binary_file1.write(str(random_string).encode('utf-8'))
        test_binary_file2.write(str(random_string).encode('utf-8'))
        test_binary_file3.write(str(random_string).encode('utf-8'))
    test_binary_file1.close()
    test_binary_file2.close()
    test_binary_file3.close()
    print("Created binary file for testing.The file contains 3,000,000,000 characters")




# Opening binary test file
try:
    binary_file = open("test_binary_file1.testbin", "r+b")
except OSError as e: # this would be "except OSError, e:" before Python 2.6
    if e.errno == errno.ENOENT: # errno.ENOENT = no such file or directory
        create_binary_test_file()
        binary_file = open("test_binary_file1.testbin", "r+b")




## example of use - perform 100 times, in each itteration: open one of the binary files and retrieve 5,000 sample strings
## (if code runs fast and without a slowdown - increase the k or other numbers and it should reproduce the problem)

## Example 1 - getting information from start of file
print("Getting information from start of file")
etime = []
for i in range(100):
    start = time.time()
    binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
    sample_index_list = random.sample(range(1,100000-1000), k=50000)
    sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
    binary_file_mm.close()
    binary_file.close()
    file_number = random.randint(1, 3)
    binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
    etime.append((time.time() - start))
    if i % 10 == 9 :
        print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()


## Example 2 - getting information from all of the file
print("Getting information from all of the file")
binary_file = open("test_binary_file1.testbin", "r+b")
etime = []
for i in range(100):
    start = time.time()
    binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
    sample_index_list = random.sample(range(1,3000000000-1000), k=50000)
    sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
    binary_file_mm.close()
    binary_file.close()
    file_number = random.randint(1, 3)
    binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
    etime.append((time.time() - start))
    if i % 10 == 9 :
        print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()

我的结果:(从文件中获取信息的平均时间比从一开始就获取信息的速度慢了将近4倍,而使用〜100个文件和并行计算,这种差异会变得更大)

My results: (The average time of getting information from all across the file is almost 4 times slower than getting information from the beginning, with ~100 files and parallel computing this difference gets much bigger)

Getting information from start of file
Iter  9         Average time -  0.14790
Iter  19        Average time -  0.14590
Iter  29        Average time -  0.14456
Iter  39        Average time -  0.14279
Iter  49        Average time -  0.14256
Iter  59        Average time -  0.14312
Iter  69        Average time -  0.14145
Iter  79        Average time -  0.13867
Iter  89        Average time -  0.14079
Iter  99        Average time -  0.13979
Getting information from all of the file
Iter  9         Average time -  0.46114
Iter  19        Average time -  0.47547
Iter  29        Average time -  0.47936
Iter  39        Average time -  0.47469
Iter  49        Average time -  0.47158
Iter  59        Average time -  0.47114
Iter  69        Average time -  0.47247
Iter  79        Average time -  0.47881
Iter  89        Average time -  0.47792
Iter  99        Average time -  0.47681

推荐答案

要确定您是否获得足够的性能,请检查可用于缓冲区/页面缓存的内存(在Linux中为free),I/O统计信息-读取次数,读取大小和持续时间(iostat;与您的硬件规格进行比较),以及进程的CPU利用率.

To determine if you're getting adequate performance, check the memory available for the buffer/page cache (free in Linux), I/O stats - the number of reads, their size and duration (iostat; compare with the specs of your hardware), and the CPU utilization of your process.

[edit]假设您从本地连接的SSD读取(高速缓存中没有所需的数据):

[edit] Assuming that you read from a locally attached SSD (without having the data you need in the cache):

  • 在单个线程中进行读取时,您应该期望50,000次读取的批处理花费7秒以上的时间(50000 *
  • When reading in a single thread, you should expect your batch of 50,000 reads to take more than 7 seconds (50000*0.000150). Probably longer because the 50k accesses of a mmap-ed file will trigger more or larger reads, as your accesses are not page-aligned - as I suggested in another Q&A I'd use simple seek/read instead (and open the file with buffering=0 to avoid unnecessary reads for Python buffered I/O).
  • With more threads/processes reading simultaneously, you can saturate your SSD throughput (how much 4KB reads/s it can do - it can be anywhere from 5,000 to 1,000,000), then the individual reads will become even slower.

[/edit]

第一个示例仅访问3 * 100KB的文件数据,因此,由于您拥有的缓存数量远远超过了缓存的可用容量,所有300KB的内容很快都进入了缓存,因此您将看不到任何I/O. ,并且您的python进程将受CPU限制.

The first example only accesses 3*100KB of the files' data, so as you have much more than that available for the cache, all of the 300KB quickly end up in the cache, so you'll see no I/O, and your python process will be CPU-bound.

我有99.99%的把握,如果您测试从每个文件的最后100KB读取数据,则其性能将与第一个示例一样好-这与数据的位置无关,而与数据的大小有关数据已访问.

第二个示例访问9GB中的随机部分,因此,仅当您有足够的可用RAM来缓存所有9GB时,并且仅在将文件预加载到缓存中之后,您才能希望看到类似的性能,从而使测试用例运行I/O为零.

The second example accesses random portions from 9GB, so you can hope to see similar performance only if you have enough free RAM to cache all of the 9GB, and only after you preload the files into the cache, so that the testcase runs with zero I/O.

在实际情况下,文件不会完全位于缓存中-因此您会看到许多I/O请求以及python的CPU使用率大大降低.由于I/O的速度比缓存访问的速度慢得多,因此您应该希望本示例的运行速度更慢.

In realistic scenarios, the files will not be fully in the cache - so you'll see many I/O requests and much lower CPU utilization for python. As I/O is much slower than cached access, you should expect this example to run slower.

这篇关于Python mmap-缓慢访问文件结尾(带有测试代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆