Linux/Perl mmap性能 [英] Linux/perl mmap performance

查看:112
本文介绍了Linux/Perl mmap性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用mmap优化大型数据集的处理.数据集在千兆字节范围内.想法是将整个文件映射到内存中,从而允许多个进程同时在数据集上工作(只读).但是,它没有按预期运行.

I'm trying to optimize handling of large datasets using mmap. A dataset is in the gigabyte range. The idea was to mmap the whole file into memory, allowing multiple processes to work on the dataset concurrently (read-only). It isn't working as expected though.

作为一个简单的测试,我只需映射文件(使用perl的Sys :: Mmap模块,并使用"mmap"子项(我认为它直接映射到底层C函数)),并使进程休眠.执行此操作时,尽管此测试对从mmap编辑的文件不执行任何操作-甚至不执行读取操作,但代码仍会花费超过一分钟的时间才能从mmap调用返回.

As a simple test I simply mmap the file (using perl's Sys::Mmap module, using the "mmap" sub which I believe maps directly to the underlying C function) and have the process sleep. When doing this, the code spends more than a minute before it returns from the mmap call, despite this test doing nothing - not even a read - from the mmap'ed file.

猜测,尽管我也许在第一次映射时才需要读取整个文件,所以在第一个进程(处于睡眠状态)中映射了文件之后,我在另一个进程中调用了一个简单的测试,读取文件的前几兆字节.

Guessing, I though maybe linux required the whole file to be read when first mmap'ed, so after the file had been mapped in the first process (while it was sleeping), I invoked a simple test in another process which tried to read the first few megabytes of the file.

令人惊讶的是,似乎第二个进程在从mmap调用返回之前也花费了很多时间,大约与第一次mmap'ing文件的时间相同.

Suprisingly, it seems the second process also spends a lot of time before returning from the mmap call, about the same time as mmap'ing the file the first time.

我确保正在使用MAP_SHARED,并且首次映射文件的进程仍处于活动状态(该文件尚未终止,并且尚未取消对mmap的映射).

I've made sure that MAP_SHARED is being used and that the process that mapped the file the first time is still active (that it has not terminated, and that the mmap hasn't been unmapped).

我希望通过映射文件可以使多个工作进程有效地随机访问大文件,但是如果每个mmap调用都需要先读取整个文件,则难度会增加一些.我尚未使用长时间运行的进程进行测试,以查看在第一次延迟后访问是否很快,但是我希望使用MAP_SHARED和另一个单独的进程就足够了.

I expected a mmapped file would allow me to give multiple worker processes effective random access to the large file, but if every mmap call requires reading the whole file first, it's a bit harder. I haven't tested using long-running processes to see if access is fast after the first delay, but I expected using MAP_SHARED and another separate process would be sufficient.

我的理论是mmap会立即或多或少地返回,而linux将或多或少按需加载块,但是我看到的行为是相反的,表明它需要在每次调用时读取整个文件映射.

My theory was that mmap would return more or less immediately, and that linux would load the blocks more or less on-demand, but the behaviour I am seeing is the opposite, indicating it requires reading through the whole file on each call to mmap.

任何想法我在做什么错,或者如果我完全误解了mmap应该如何工作?

Any idea what I'm doing wrong, or if I've completely misunderstood how mmap is supposed to work?

推荐答案

好,发现了问题.怀疑的是,Linux和Perl都不应该受到指责.要打开和访问文件,我会执行以下操作:

Ok, found the problem. As suspected, neither linux or perl were to blame. To open and access the file I do something like this:

#!/usr/bin/perl
# Create 1 GB file if you do not have one:
# dd if=/dev/urandom of=test.bin bs=1048576 count=1000
use strict; use warnings;
use Sys::Mmap;

open (my $fh, "<test.bin")
    || die "open: $!";

my $t = time;
print STDERR "mmapping.. ";
mmap (my $mh, 0, PROT_READ, MAP_SHARED, $fh)
    || die "mmap: $!";
my $str = unpack ("A1024", substr ($mh, 0, 1024));
print STDERR " ", time-$t, " seconds\nsleeping..";

sleep (60*60);

如果您测试该代码,就不会像我在原始代码中发现的那样出现延迟,并且在创建最小样本后(总是这样做,对!),原因突然变得很明显.

If you test that code, there are no delays like those I found in my original code, and after creating the minimal sample (always do that, right!) the reason suddenly became obvious.

错误是我在我的代码中将$mh标量视为一个句柄,它重量轻且可以轻松移动(阅读:按值传递).事实证明,它实际上是一个GB长字符串,绝对不是您希望在不创建显式引用(指针"/句柄值的perl语言)的情况下移动的内容.因此,如果您需要存储在散列或类似文件中,请确保存储\$mh,并在需要像${$hash->{mh}}这样使用它时将其取消引用,通常将其用作substr或类似文件中的第一个参数.

The error was that I in my code treated the $mh scalar as a handle, something which is light weight and can be moved around easily (read: pass by value). Turns out, it's actually a GB long string, definitively not something you want to move around without creating an explicit reference (perl lingua for a "pointer"/handle value). So if you need to store in in a hash or similar, make sure you store \$mh, and deref it when you need to use it like ${$hash->{mh}}, typically as the first parameter in a substr or similar.

这篇关于Linux/Perl mmap性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆