在块层/设备上绕过4KB块大小限制 [英] Bypassing 4KB block size limitation on block layer/device

查看:121
本文介绍了在块层/设备上绕过4KB块大小限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在开发一种ssd类型的存储硬件设备,该设备可以一次对大于4KB的大块大小(甚至以MB大小为单位)进行读/写请求. 我的理解是linux及其文件系统会将文件砍"成4KB的块大小,并将其传递给块设备驱动程序,这将需要用设备中的数据物理填充该块(例如,用于写操作)

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size). My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)

我也知道内核页面大小在此限制中起作用,因为它设置为4KB.

I am also aware the kernel page size has a role in this limitation as it is set at 4KB.

对于实验,我想找出是否有一种方法可以实际增加此块的大小,从而节省一些时间(而不是执行多个4KB的写操作,而可以使用更大的块大小来做到这一点).

For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).

我是否可以查看任何FS或任何现有项目? 如果没有,那么进行此实验需要什么-需要修改linux的哪些部分? 试图找出困难程度和所需资源.或者,如果甚至不可能这样做和/或我们甚至不需要这样做的任何原因.任何评论表示赞赏.

Is there any FS or any existing project that I can take a look for this? If not, what is needed to do this experiment - what parts of linux needs to be modified? Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.

谢谢.

推荐答案

4k的限制归因于页面缓存.主要问题是,如果页面大小为4k,但块大小为32k,那么如果文件只有2000字节长,会发生什么情况,因此,您仅分配4k页面来覆盖块的前4k.现在有人试图偏移20000,并写入一个字节.现在假设系统承受了很大的内存压力,并且前2000个字节的4k页(干净的)被压出内存了.您如何跟踪32k块的哪些部分包含有效数据,以及当系统需要以偏移量20000写入脏页时会发生什么情况?

The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?

此外,我们假设系统承受着巨大的内存压力,我们需要写出最后一页;如果没有足够的内存来实例化32k块中的其他28k怎么办,那么我们可以执行读取-修改-写入周期,只是以偏移量20000更新一个脏的4k页面吗?

Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?

这些问题都可以解决,但是在VM层需要大量的手术. VM层将需要知道,对于此文件系统,页面需要一次以8页为一个块进行实例化,并且如果存在将特定页面推出的内存压力,则需要全部8页写出如果脏了,则同时打印,然后同时从页面缓存中删除所有8个页面.所有这些都意味着您不希望在4k页面级别上而是在复合32k页面/块"级别上跟踪页面使用情况和页面肮脏程度.基本上,它将涉及到VM子系统的几乎每个部分的更改,从页面清理器到页面错误处理程序,页面扫描程序,写回算法等,等等.

These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.

还要考虑,即使您确实雇用了Linux VM专家来完成这项工作(HDD供应商也会深切希望您这样做,因为他们还希望能够以32k或64k物理扇区大小来部署HDD) ,这样经过修改的VM层要出现在Red Hat Enterprise Linux内核或SuSE或Ubuntu的等效企业或LTS内核中,还需要5到7年的时间.因此,如果您在一家希望将您的SSD产品推向企业市场的初创公司工作,那么您现在最好还是放弃这种方法.没钱了就没用.

Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.

现在,如果您碰巧正在一家自己制造硬件的大型Cloud公司工作(例如Facebook,Amazon,Google等),那么您可能会走这条路,因为他们不使用企业内核可以轻松地添加新功能---出于这个原因,他们希望相对靠近上游内核,以最大程度地降低维护成本.

Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.

如果您确实为这些大型云公司之一工作,我强烈建议您与处于同一领域的其他公司联系,也许您可​​以与他们合作,看看是否可以一起进行这种开发共同努力,将这种变化带到上游.确实,确实并非微不足道的改变,但是---特别是因为上游linux内核开发人员将要求这在通常情况下不会对性能产生负面影响,而不会在不久的将来会涉及> 4k块设备.而且,如果您在Facebook,Google,Amazon等公司工作,这不是作为内核的私有更改而要维护的那种更改,而是您想上游的东西,否则就是这样.这将是一项巨大的侵入性更改,以至于无法将其作为树外的补丁来支持将是巨大的麻烦.

If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.

这篇关于在块层/设备上绕过4KB块大小限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆