坏Linux内存映射文件性能随机访问C ++&蟒蛇 [英] Bad Linux Memory Mapped File Performance with Random Access C++ & Python

查看:216
本文介绍了坏Linux内存映射文件性能随机访问C ++&蟒蛇的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用内存映射文件创建一个多GB文件(大约13gb),我遇到了似乎是一个问题与mmap()的问题。最初的实现是在c ++中使用boost :: iostreams :: mapped_file_sink在Windows上完成的,一切都很好。然后在Linux上运行代码,在Linux上运行几分钟,Linux上的时间变为几小时。



这两台机器是相同硬件的克隆:Dell R510 2.4GHz 8M Cache 16GB RAM 1TB磁盘PERC H200控制器。



Linux是使用3.8内核和g ++ 4.83的Oracle企业版Linux 6.5。



有一些担心,boost库可能有问题,所以实现使用boost :: interprocess :: file_mapping和本地mmap()。所有三个显示相同的行为。



完整的源代码和性能数字在下面链接。

  //使用boost :: iostreams的C ++代码
void IostreamsMapping(size_t rowCount)
{
std :: string outputFileName =IoStreamsMapping.out;
boost :: iostreams :: mapped_file_params params(outputFileName);
params.new_file_size = static_cast< boost :: iostreams :: stream_offset>(sizeof(uint64_t)* rowCount);
boost :: iostreams :: mapped_file_sink fileSink(params); //注意:使用此形式的构造函数将负责创建和调整文件大小。
uint64_t * dest = reinterpret_cast< uint64_t *>(fileSink.data());
DoMapping(dest,rowCount);
}

void DoMapping(uint64_t * dest,size_t rowCount)
{
inputStream-> seekg(0,std :: ios :: beg);
uint32_t index,value;
for(size_t i = 0; i {
inputStream-> read(reinterpret_cast< char *>(& index),static_cast< std: :streamsize>(sizeof(uint32_t)));
inputStream-> read(reinterpret_cast< char *>(& value),static_cast< std :: streamsize>(sizeof(uint32_t))));
dest [index] = value;
}
}

最后一个测试在Python中完成,另一种语言。发生在同一个地方,因此看起来像同样的问题。

 #使用numpy的Python代码
import numpy as np
fpr = np.memmap(inputFile,dtype ='uint32',mode ='r',shape =(count * 2))
out = np.memmap(outputFile,dtype =' uip64',mode ='w +',shape =(count))
print(writing output)
out [fpr [:: 2]] = fpr [:: 2]

对于c ++测试,Windows和Linux具有类似的性能,大约3亿个int64(Linux看起来更快)。看起来性能下降在Linux上的大约3Gb(4亿* 8字节每int64 = 3.2Gb)对于C ++和Python。



我知道在32位Linux 3Gb是一个神奇的边界,但不知道64位Linux的类似行为。



结果的要点是1.4分钟,Windows在Linux上变为1.7小时4亿int64s。我实际上正在尝试映射接近13亿的int64。



任何人都可以解释为什么Windows和Linux之间的性能断开?



任何帮助或建议都将非常感谢!



LoadTest.cpp



Makefile



LoadTest.vcxproj



更新了mmap_test.py



原始mmap_test.py



更新结果使用更新的Python代码... Python速度现在与C ++相当



原始结果注意:Python结果是陈旧的

解决方案

编辑:升级到正确答案。问题是脏页由Linux处理的方式。我仍然希望我的系统刷新脏页面现在和再次,所以我不允许它有很多未完成的页面。但同时,我可以证明这是发生了什么。



我这样做了(使用sudo -i):

 #echo 80> / proc / sys / vm / dirty_ratio 
#echo 60> / proc / sys / vm / dirty_background_ratio

其中给出了这些设置VM脏设置:

  grep ^ / proc / sys / vm / dirty * 
/ proc / sys / vm / dirty_background_bytes:0
/ proc / sys / vm / dirty_background_ratio:60
/ proc / sys / vm / dirty_bytes:0
/ proc / sys / vm / dirty_expire_centisecs:3000
/ proc / sys / vm / dirty_ratio:80
/ proc / sys / vm / dirty_writeback_centisecs:500

  $ ./a.out m64 200000000 
设置持续时间33.1042秒
Linux:mmap64
size = 1525 MB
映射持续时间30.6785秒
总体持续时间91.7038秒

与之前比较:

  $ ./a.out m64 200000000 
设置持续时间33.7436秒
Linux:mmap64
size = 1525
映射持续时间1467.49秒
总体持续时间1501.89秒


b $ b

有这些虚拟机脏设置:

  grep ^ / proc / sys / vm / dirty * 
/ proc / sys / vm / dirty_background_bytes:0
/ proc / sys / vm / dirty_background_ratio:10
/ proc / sys / vm / dirty_bytes:0
/ proc / sys / vm / dirty_expire_centisecs:3000
/ proc / sys / vm / dirty_ratio:20
/ proc / sys / vm / dirty_writeback_centisecs:500


我不确定应该使用什么设置来获得IDEAL性能,同时仍然不会让所有脏页面永远呆在内存中(意味着如果系统崩溃,它需要更长的时间写出到磁盘)。



对于历史:这里是我最初写为非答案 - 这里的一些评论仍然适用...



不是一个答案,但我觉得有趣的是,如果我改变代码首先读取整个数组,并写出来,它的速度比在同一个循环中的速度快了。我明白,如果你需要处理真正巨大的数据集(大于内存),这是完全没有用。将原始代码设为已过帐,100M uint64值的时间为134秒。当我分裂读和写周期,它是43s。



这是修改后的 DoMapping 函数[只有代码已更改]:

  struct VI 
{
uint32_t value;
uint32_t index;
};


void DoMapping(uint64_t * dest,size_t rowCount)
{
inputStream-> seekg(0,std :: ios :: beg);
std :: chrono :: system_clock :: time_point startTime = std :: chrono :: system_clock :: now();
uint32_t index,value;
std :: vector< VI>数据;
for(size_t i = 0; i {
inputStream-> read(reinterpret_cast< char *>(& index),static_cast< std :: streamsize>(sizeof(uint32_t))));
inputStream-> read(reinterpret_cast< char *>(& value),static_cast< std :: streamsize>(sizeof(uint32_t))));
VI d = {index,value};
data.push_back(d);
}
for(size_t i = 0; i {
value = data [i] .value;
index = data [i] .index;
dest [index] = value;
}
std :: chrono :: duration< double> mappingTime = std :: chrono :: system_clock :: now() - startTime;
std :: cout<< 映射持续时间< mappingTime.count()< 秒< std :: endl;
inputStream.reset();
}

我目前正在运行一个包含200M记录的测试,大量的时间(2000+秒没有代码更改)。很清楚,所用的时间是从磁盘I / O,我看到的IO速率为50-70MB / s,这是相当不错,因为我真的不希望我相当不复杂的设置提供很多比那更多的。改进不是与较大的大小一样好,但仍然有一个体面的改进:1502秒的总时间,vs 2021s为读和写在相同的循环。



此外,我想指出,对于任何系统来说这是一个相当可怕的测试 - 事实上,Linux比Windows更糟的是,不要真正想映射一个大文件,并随机写入每个页面8个字节[意味着必须读入4KB页面]。如果这反映了你的实际申请,那么你认真地应该以某种方式重新思考你的方法。当你有足够的可用内存,整个内存映射区域适合RAM时,它会运行良好。



在我的系统中有大量的RAM,所以我认为问题是Linux不喜欢太多映射的页面脏。



我有一种感觉,这可能与它有关系:
http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages
更多解释:
http://www.westnet.com/~gsmith/content/linux-pdflush.htm



不幸的是,我也很累,需要睡觉。我会看看我明天是否可以尝试这些 - 但不要紧张。就像我说的,这不是真的是一个答案,而是一个长的评论,不真正适合在评论(和包含的代码,这是完全垃圾阅读评论)


While trying to use memory mapped files to create a multi-gigabyte file (around 13gb), I ran into what appears to be a problem with mmap(). The initial implementation was done in c++ on Windows using boost::iostreams::mapped_file_sink and all was well. The code was then run on Linux and what took minutes on Windows became hours on Linux.

The two machines are clones of the same hardware: Dell R510 2.4GHz 8M Cache 16GB Ram 1TB Disk PERC H200 Controller.

The Linux is Oracle Enterprise Linux 6.5 using the 3.8 kernel and g++ 4.83.

There was some concern that there may be a problem with the boost library, so implementations were done with boost::interprocess::file_mapping and again with native mmap(). All three show the same behavior. The Windows and Linux performance is on par to a certain point when the Linux performance falls off badly.

Full source code and performance numbers are linked below.

// C++ code using boost::iostreams
void IostreamsMapping(size_t rowCount)
{
   std::string outputFileName = "IoStreamsMapping.out";
   boost::iostreams::mapped_file_params params(outputFileName);
   params.new_file_size = static_cast<boost::iostreams::stream_offset>(sizeof(uint64_t) * rowCount);
   boost::iostreams::mapped_file_sink fileSink(params); // NOTE: using this form of the constructor will take care of creating and sizing the file.
   uint64_t* dest = reinterpret_cast<uint64_t*>(fileSink.data());
   DoMapping(dest, rowCount);
}

void DoMapping(uint64_t* dest, size_t rowCount)
{
   inputStream->seekg(0, std::ios::beg);
   uint32_t index, value;
   for (size_t i = 0; i<rowCount; ++i)
   {
      inputStream->read(reinterpret_cast<char*>(&index), static_cast<std::streamsize>(sizeof(uint32_t)));
      inputStream->read(reinterpret_cast<char*>(&value), static_cast<std::streamsize>(sizeof(uint32_t)));
      dest[index] = value;
   }
}

One final test was done in Python to reproduce this in another language. The fall off happened at the same place, so looks like the same problem.

# Python code using numpy
import numpy as np
fpr = np.memmap(inputFile, dtype='uint32', mode='r', shape=(count*2))
out = np.memmap(outputFile, dtype='uint64', mode='w+', shape=(count))
print("writing output")
out[fpr[::2]]=fpr[::2]

For the c++ tests Windows and Linux have similar performance up to around 300 million int64s (with Linux looking slightly faster). It looks like performance falls off on Linux around 3Gb (400 million * 8 bytes per int64 = 3.2Gb) for both C++ and Python.

I know on 32-bit Linux that 3Gb is a magic boundary, but am unaware of similar behavior for 64-bit Linux.

The gist of the results is 1.4 minutes for Windows becoming 1.7 hours on Linux at 400 million int64s. I am actually trying to map close to 1.3 billion int64s.

Can anyone explain why there is such a disconnect in performance between Windows and Linux?

Any help or suggestions would be greatly appreciated!

LoadTest.cpp

Makefile

LoadTest.vcxproj

updated mmap_test.py

original mmap_test.py

Updated Results With updated Python code...Python speed now comparable with C++

Original Results NOTE: The Python results are stale

解决方案

Edit: Upgrading to "proper answer". The problem is with the way that "dirty pages" are handled by Linux. I still want my system to flush dirty pages now and again, so I didn't allow it to have TOO many outstanding pages. But at the same time, I can show that this is what is going on.

I did this (with "sudo -i"):

# echo 80 > /proc/sys/vm/dirty_ratio
# echo 60 > /proc/sys/vm/dirty_background_ratio

Which gives these settings VM dirty settings:

grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:60
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:80
/proc/sys/vm/dirty_writeback_centisecs:500

This makes my benchmark run like this:

$ ./a.out m64 200000000
Setup Duration 33.1042 seconds
Linux: mmap64
size=1525 MB
Mapping Duration 30.6785 seconds
Overall Duration 91.7038 seconds

Compare with "before":

$ ./a.out m64 200000000
Setup Duration 33.7436 seconds
Linux: mmap64
size=1525
Mapping Duration 1467.49 seconds
Overall Duration 1501.89 seconds

which had these VM dirty settings:

grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

I'm not sure exactly what settings I should use to get IDEAL performance whilst still not leaving all dirty pages sitting around in memory forever (meaning that if the system crashes, it takes much longer to write out to disk).

For history: Here's what I originally wrote as a "non-answer" - some comments here still apply...

Not REALLY an answer, but I find it rather interesting that if I change the code to first read the entire array, and the write it out, it's SIGNIFICANTLY faster, than doing both in the same loop. I appreciate that this is utterly useless if you need to deal with really huge data sets (bigger than memory). With the original code as posted, the time for 100M uint64 values is 134s. When I split the read and the write cycle, it's 43s.

This is the DoMapping function [only code I've changed] after modification:

struct VI
{
    uint32_t value;
    uint32_t index;
};


void DoMapping(uint64_t* dest, size_t rowCount)
{
   inputStream->seekg(0, std::ios::beg);
   std::chrono::system_clock::time_point startTime = std::chrono::system_clock::now();
   uint32_t index, value;
   std::vector<VI> data;
   for(size_t i = 0; i < rowCount; i++)
   {
       inputStream->read(reinterpret_cast<char*>(&index), static_cast<std::streamsize>(sizeof(uint32_t)));
       inputStream->read(reinterpret_cast<char*>(&value), static_cast<std::streamsize>(sizeof(uint32_t)));
       VI d = {index, value};
       data.push_back(d);
   }
   for (size_t i = 0; i<rowCount; ++i)
   {
       value = data[i].value;
       index = data[i].index;
       dest[index] = value;
   }
   std::chrono::duration<double> mappingTime = std::chrono::system_clock::now() - startTime;
   std::cout << "Mapping Duration " << mappingTime.count() << " seconds" << std::endl;
   inputStream.reset();
}

I'm currently running a test with 200M records, which on my machine takes a significant amount of time (2000+ seconds without code-changes). It is very clear that the time taken is from disk-I/O, and I'm seeing IO-rates of 50-70MB/s, which is pretty good, as I don't really expect my rather unsophisticated setup to deliver much more than that. The improvement is not as good with the larger size, but still a decent improvement: 1502s total time, vs 2021s for the "read and write in the same loop".

Also, I'd like to point out that this is a rather terrible test for any system - the fact that Linux is notably worse than Windows is beside the point - you do NOT really want to map a large file and write 8 bytes [meaning the 4KB page has to be read in] to each page at random. If this reflects your REAL application, then you seriously should rethink your approach in some way. It will run fine when you have enough free memory that the whole memory-mapped region fits in RAM.

There is plenty of RAM in my system, so I believe that the problem is that Linux doesn't like too many mapped pages that are "dirty".

I have a feeling that this may have something to do with it: http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages More explanation: http://www.westnet.com/~gsmith/content/linux-pdflush.htm

Unfortunately, I'm also very tired, and need to sleep. I'll see if I can experiment with these tomorrow - but don't hold your breath. Like I said, this is not REALLY an answer, but rather a long comment that doesn't really fit in a comment (and contains code, which is completely rubbish to read in a comment)

这篇关于坏Linux内存映射文件性能随机访问C ++&amp;蟒蛇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆