磁盘 I/O 期间幕后发生了什么? [英] What goes on behind the curtains during disk I/O?

查看:24
本文介绍了磁盘 I/O 期间幕后发生了什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我寻找文件中的某个位置并写入少量数据(20 字节)时,幕后发生了什么?

When I seek to some position in a file and write a small amount of data (20 bytes), what goes on behind the scenes?

我的理解

据我所知,可以从磁盘写入或读取的最小数据单位是一个扇区(传统上为 512 字节,但该标准现在正在改变).这意味着写入 20 个字节我需要读取整个扇区,在内存中修改其中的一些并将其写回磁盘.

To my knowledge, the smallest unit of data that can be written or read from a disk is one sector (traditionally 512 bytes, but that standard is now changing). That means to write 20 bytes I need to read a whole sector, modify some of it in memory and write it back to disk.

这是我期望在无缓冲 I/O 中发生的事情.我也希望缓冲 I/O 做大致相同的事情,但要巧妙处理它的缓存.所以我会想,如果我通过随机查找和写入将局部性排除在外,缓冲和无缓冲 I/O 应该具有相似的性能......也许无缓冲会稍微好一些.

This is what I expect to be happening in unbuffered I/O. I also expect buffered I/O to do roughly the same thing, but be clever about its cache. So I would have thought that if I blow locality out the window by doing random seeks and writes, both buffered and unbuffered I/O ought to have similar performance... maybe with unbuffered coming out slightly better.

再说一次,我知道缓冲的 I/O 只缓冲一个扇区是很疯狂的,所以我也可能期望它的表现非常糟糕.

Then again, I know it's crazy for buffered I/O to only buffer one sector, so I might also expect it to perform terribly.

我的申请

我正在存储一个 SCADA 设备驱动程序收集的值,该驱动程序接收超过十万个点的远程遥测数据.文件中有额外数据,每条记录为 40 字节,但在更新期间只需写入其中的 20 字节.

I am storing values gathered by a SCADA device driver that receives remote telemetry for upwards of a hundred thousand points. There is extra data in the file such that each record is 40 bytes, but only 20 bytes of that needs to be written during an update.

实施前基准

为了检查我是否不需要想出一些出色的过度设计的解决方案,我使用写入到一个可能包含总共 200,000 条记录的文件中的几百万条随机记录运行了一个测试.为了公平起见,每个测试都使用相同的值来播种随机数生成器.首先我擦除文件并将其填充到总长度(约 7.6 兆),然后循环几百万次,将随机文件偏移量和一些数据传递给两个测试函数之一:

To check that I don't need to dream up some brilliantly over-engineered solution, I have run a test using a few million random records written to a file that could contain a total of 200,000 records. Each test seeds the random number generator with the same value to be fair. First I erase the file and pad it to the total length (about 7.6 meg), then loop a few million times, passing a random file offset and some data to one of two test functions:

void WriteOldSchool( void *context, long offset, Data *data )
{
    int fd = (int)context;
    lseek( fd, offset, SEEK_SET );
    write( fd, (void*)data, sizeof(Data) );
}

void WriteStandard( void *context, long offset, Data *data )
{
    FILE *fp = (FILE*)context;
    fseek( fp, offset, SEEK_SET );
    fwrite( (void*)data, sizeof(Data), 1, fp );
    fflush(fp);
}

也许没有惊喜?

OldSchool 方法名列前茅 - 很多.它的速度提高了 6 倍以上(每秒 148 万对 232000 条记录).为确保没有遇到硬件缓存,我将数据库大小扩展到 2000 万条记录(文件大小为 763 兆)并得到相同的结果.

The OldSchool method came out on top - by a lot. It was over 6 times faster (1.48 million versus 232000 records per second). To make sure I hadn't run into hardware caching, I expanded my database size to 20 million records (file size of 763 meg) and got the same results.

在您指出对 fflush 的明显调用之前,让我说删除它没有任何效果.我想这是因为当我寻找足够远的地方时必须提交缓存,这就是我大部分时间所做的.

Before you point out the obvious call to fflush, let me say that removing it had no effect. I imagine this is because the cache must be committed when I seek sufficiently far away, which is what I'm doing most of the time.

那么,发生了什么?

在我看来,每当我尝试写入时,缓冲的 I/O 必须读取(并可能写入全部)文件的一大块.因为我几乎没有利用过它的缓存,所以这是非常浪费的.

It seems to me that the buffered I/O must be reading (and possibly writing all of) a large chunk of the file whenever I try to write. Because I am hardly ever taking advantage of its cache, this is extremely wasteful.

另外(我不知道磁盘上硬件缓存的细节),如果缓冲的I/O在我只更改一个扇区时试图写入一堆扇区,那会降低硬件缓存的有效性.

In addition (and I don't know the details of hardware caching on disk), if the buffered I/O is trying to write a bunch of sectors when I change only one, that would reduce the effectiveness of the hardware cache.

是否有任何磁盘专家可以比我的实验结果更好地评论和解释这一点?=)

Are there any disk experts out there who can comment and explain this better than my experimental findings? =)

推荐答案

确实,至少在我的系统上使用 GNU libc,看起来 stdio 在写回更改的部分之前正在读取 4kB 块.对我来说似乎是假的,但我想当时有人认为这是个好主意.

Indeed, at least on my system with GNU libc, it looks like stdio is reading 4kB blocks before writing back the changed portion. Seems bogus to me, but I imagine somebody thought it was a good idea at the time.

我通过编写一个简单的C程序来打开一个文件,写入一小段数据,然后退出;然后在 strace 下运行它,看看它实际触发了哪些系统调用.以 10000 的偏移量写入,我看到了这些系统调用:

I checked by writing a trivial C program to open a file, write a small of data once, and exit; then ran it under strace, to see which syscalls it actually triggered. Writing at an offset of 10000, I saw these syscalls:

lseek(3, 8192, SEEK_SET)                = 8192
read(3, ""..., 1808) = 1808
write(3, "hello", 5)                    = 5

似乎您希望在这个项目中坚持使用低级 Unix 风格的 I/O,嗯?

Seems that you'll want to stick with the low-level Unix-style I/O for this project, eh?

这篇关于磁盘 I/O 期间幕后发生了什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆