为什么 Linux 在目录上使用 getdents() 而不是 read()? [英] Why does Linux use getdents() on directories instead of read()?

查看:30
本文介绍了为什么 Linux 在目录上使用 getdents() 而不是 read()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我浏览了 K&R C,我注意到为了读取目录中的条目,他们使用了:

while (read(dp->fd, (char *) &dirbuf, sizeof(dirbuf)) == sizeof(dirbuf))/* 代码 */

其中 dirbuf 是系统特定的目录结构,dp->fd 是有效的文件描述符.在我的系统上,dirbuf 应该是一个 struct linux_dirent.请注意,struct linux_dirent 具有用于条目名称的灵活数组成员,但为了简单起见,让我们假设它没有.(在这种情况下处理灵活的数组成员只需要一些额外的样板代码).

然而,Linux 不支持这种结构.当使用 read() 尝试读取上述目录条目时,read() 返回 -1 并且 errno 是设置为 EISDIR.

相反,Linux 专门用于读取目录的系统调用,即 getdents() 系统调用.但是,我注意到它的工作方式与上述几乎相同.

while (syscall(SYS_getdents, fd, &dirbuf, sizeof(dirbuf)) != -1)/* 代码 */

这背后的原因是什么?与在 K&R 中使用 read() 相比,似乎没有什么好处.

解决方案

getdents 将返回 struct linux_dirent.它将为任何基础类型的文件系统执行此操作.on disk"格式可能完全不同,只有给定的文件系统驱动程序知道,所以简单的用户空间读取调用是行不通的.也就是说,getdents 可以从原生格式转换到 linux_dirent 中.

<块引用>

用 read() 从文件中读取字节不能说同样的话吗?文件中数据的磁盘格式在文件系统之间不需要统一,甚至在磁盘上也不需要 - 因此,从磁盘读取一系列字节将再次成为我希望委托给文件系统驱动程序的事情.

由 VFS [虚拟文件系统"] 层处理的不连续文件数据.不管 FS 选择如何组织文件的块列表(例如 ext4 使用inode":索引"或信息"节点.这些使用ISAM"(索引顺序访问方法")组织.但是,MS/DOS FS 可以有一个完全不同的组织).

每个 FS 驱动程序在启动时都会注册一个 VFS 函数回调表.对于给定的操作(例如open/close/read/write/seek),表中有相应的条目.

VFS 层(即来自用户空间的系统调用)将向下调用"到 FS 驱动程序,FS 驱动程序将执行操作,做它认为必要的任何事情来满足请求.

<块引用>

我假设 FS 驱动程序会知道磁盘上常规文件中数据的位置 - 即使数据是碎片化的.

是的.例如,如果读取请求是从文件中读取前三个块(例如 0,1,2),FS 将查找文件的索引信息并获得要读取的物理块列表(例如 1000000,200,37) 从磁盘表面.这一切都在 FS 驱动程序中透明处理.

用户空间程序只会看到它的缓冲区填满了正确的数据,而不管 FS 索引和块提取有多复杂.

也许[松散地]将其称为传输 inode 数据更合适,因为有文件的 inode(即 inode 具有索引信息以分散/收集"文件的 FS 块).但是,FS 驱动程序也在内部使用它从目录中读取.也就是说,每个目录都有一个 inode 来跟踪该目录的索引信息.

因此,对于 FS 驱动程序,目录很像具有特殊格式信息的平面文件.这些是目录条目".这就是 getdents 返回的内容.它位于"inode 索引层之上.

目录条目可以是可变长度的[基于文件名的长度].因此,磁盘格式将是(称为类型 A"):

静态部分|变长名称静态部分|变长名称...

但是……一些 FS 以不同的方式组织自己(称之为B 型"):

,...<变量1>,<变量2>,...

因此,类型 A 组织可能被用户空间read(2)调用原子读取,类型 B 将有困难.所以,getdents VFS 调用处理这个.

<块引用>

VFS 不能也像 VFS 呈现文件的平面视图"一样呈现目录的linux_dirent"视图吗?

这就是 getdents 的用途.

<块引用>

再说一次,我假设 FS 驱动程序知道每个文件的类型,因此当 read() 对目录而不是一系列字节调用时,可以返回一个 linux_dirent.

getdents并非一直存在.当目录是固定大小并且只有 一种 FS 格式时,readdir(3) 调用可能会在下面执行 read(2) 并得到一个字节序列[ read(2) 提供的内容].实际上,IIRC,一开始只有readdir(2)getdentsreaddir(3)不存在.

但是,如果 read(2) 是短的"(例如两个字节太小),你会怎么做?您如何将其传达给应用?

<块引用>

我的问题更像是因为 FS 驱动程序可以确定文件是目录还是常规文件(我假设它可以),并且由于它最终必须拦截所有 read() 调用,为什么不是t read() 在实现为读取 linux_dirent 的目录上?

目录上的

read 不会被拦截并转换为 getdents,因为操作系统是极简主义的.它希望您了解差异并进行适当的系统调用.

您对文件或目录执行 open(2) [opendir(3) 是包装器并且在下面执行 open(2)].您可以读/写/查找文件,查找/获取目录.

但是...为返回EISDIRread.[旁注:我在最初的评论中忘记了这一点].在它提供的简单平面数据"模型中,没有一种方法可以传达/控制 getdents 可以/做的所有事情.

因此,内核应用程序开发人员通过getdents界面,而不是允许以低劣的方式获取部分/错误的信息.>

此外,getdents以原子方式做事.如果您正在读取给定程序中的目录条目,则可能有其他程序正在创建和删除该目录中的文件或重命名它们——就在您的 getdents 序列的中间.

getdents 将呈现一个 atomic 视图.文件存在或不存在.它已重命名或尚未重命名.因此,无论您周围发生了多少动荡",您都不会获得半修改"的观点.当您向 getdents 索要 20 个条目时,您会得到它们 [如果只有那么多,则可以得到 10 个].

旁注:一个有用的技巧是过度指定"计数.也就是说,告诉 getdents 您需要 50,000 个条目 [您必须提供空间].你通常会得到大约 100 左右的东西.但是,现在,您拥有的是完整目录的原子 快照.我有时会这样做,而不是循环计数 1--YMMV.您仍然必须防止立即消失,但至少您可以看到它(即后续文件打开失败)

因此,对于刚刚删除的文件,您总是会获得完整"条目和没有条目.不是说文件还在那里,只是说它在getdents发生时就在那里.另一个进程可能会立即擦除它,但不会getdents

中间

如果允许 read(2) ,则您必须猜测要读取多少数据,并且不知道在部分状态.如果 FS 具有上述类型 B 组织,则单次读取不能在单个步骤中原子地获取静态部分和可变部分.

放慢 read(2)getdents 所做的事情在哲学上是不正确的.

getdentsunlinkcreatrmdirrename(等等.) 操作是互锁和序列化以防止任何不一致[更不用说 FS 损坏或泄漏/丢失 FS 块].换句话说,这些系统调用都相互了解".

如果 pgmA 将x"重命名为z"并且 pgmB 将y"重命名为z",则它们不会发生冲突.一个先行,另一个是第二个,但没有 FS 块丢失/泄漏.getdents 获取整个视图(无论是x y"、y z"、x z"还是z"),但它永远不会同时看到x y z".

I was skimming through K&R C and I noticed that to read the entries in a directories, they used:

while (read(dp->fd, (char *) &dirbuf, sizeof(dirbuf)) == sizeof(dirbuf))
    /* code */

Where dirbuf was a system-specific directory structure, and dp->fd a valid file descriptor. On my system, dirbuf would have been a struct linux_dirent. Note that a struct linux_dirent has a flexible array member for the entry name, but let us assume, for the sake of simplicity, that it doesn't. (Dealing with the flexible array member in this scenario would only require a little extra boilerplate code).

Linux, however, doesn't support this construct. When using read() to try reading directory entries as above, read() returns -1 and errno is set to EISDIR.

Instead, Linux dedicates a system call specifcally for reading directories, namely the getdents() system call. However, I've noticed that it works in pretty much the same way as above.

while (syscall(SYS_getdents, fd, &dirbuf, sizeof(dirbuf)) != -1)
    /* code */

What was the rational behind this? There seems to be little/no benefit compared to using read() as done in K&R.

解决方案

getdents will return struct linux_dirent. It will do this for any underlying type of filesystem. The "on disk" format could be completely different, known only to the given filesystem driver, so a simple userspace read call could not work. That is, getdents may convert from the native format to fill the linux_dirent.

couldn't the same thing be said about reading bytes from a file with read()? The on disk format of the data within a file isn't necessary uniform across filesystems or even contiguous on disk - thus, reading a series of bytes from disk would again be something I expect to be delegated to the file system driver.

The discontiguous file data in handled by the VFS ["virtual filesystem"] layer. Regardless of how a FS chooses to organize the block list for a file (e.g. ext4 uses "inodes": "index" or "information" nodes. these use an "ISAM" ("index sequential access method") organization. But, an MS/DOS FS can have a completely different organization).

Each FS driver registers a table of VFS function callbacks when it's started. For a given operation (e.g. open/close/read/write/seek), there is corresponding entry in the table.

The VFS layer (i.e. from the userspace syscall) will "call down" into the FS driver and the FS driver will perform the operation, doing whatever it deems necessary to fulfill the request.

I assume that the FS driver would know about the location of the data inside a regular file on disk - even if the data was fragmented.

Yes. For example, if the read request is to read the first three blocks from the file (e.g. 0,1,2), the FS will look up the indexing information for the file and get a list of physical blocks to read (e.g. 1000000,200,37) from the disk surface. This is all handled transparently in the FS driver.

The userspace program will simply see its buffer filled up with the correct data, without regard to how complex the FS indexing and block fetch had to be.

Perhaps it is [loosely] more proper to refer to this as transferring inode data as there are inodes for files (i.e. an inode has the indexing information to "scatter/gather" the FS blocks for the file). But, the FS driver also uses this internally to read from a directory. That is, each directory has an inode to keep track of the indexing information for that directory.

So, to an FS driver, a directory is much like a flat file that has specially formatted information. These are the directory "entries". This is what getdents returns. This "sits on top of" the inode indexing layer.

Directory entries can be variable length [based on the length of the filename]. So, the on disk format would be (call it "Type A"):

static part|variable length name
static part|variable length name
...

But ... some FSes organize themselves differently (call it "Type B"):

<static1>,<static2>...
<variable1>,<variable2>,...

So, the type A organization might be read atomically by a userspace read(2) call, the type B would have difficulty. So, the getdents VFS call handles this.

couldn't the VFS also present a "linux_dirent" view of a directory like the VFS presents a "flat view" of a file?

That is what getdents is for.

Then again, I'm assuming that a FS driver knows the type of each file and thus could return a linux_dirent when read() is called on a directory rather than a series of bytes.

getdents did not always exist. When dirents were fixed size and there was only one FS format, the readdir(3) call probably did read(2) underneath and got a series of bytes [which is only what read(2) provides]. Actually, IIRC, in the beginning there was only readdir(2) and getdents and readdir(3) did not exist.

But, what do you do if the read(2) is "short" (e.g. two bytes too small)? How do you communicate that to the app?

My question is more like since the FS driver can determine whether a file is a directory or a regular file (and I'm assuming it can), and since it has to intercept all read() calls eventually, why isn't read() on a directory implemented as reading the linux_dirent?

read on a dir isn't intercepted and converted to getdents because the OS is minimalist. It expects you to know the difference and make the appropriate syscall.

You do open(2) for files or dirs [opendir(3) is wrapper and does open(2) underneath]. You can read/write/seek for file and seek/getdents for dirs.

But ... doing read for returns EISDIR. [Side note: I had forgotten this in my original comments]. In the simple "flat data" model it provides, there isn't a way to convey/control all that getdents can/does.

So, rather than allow an inferior way to get partial/wrong info, it's simpler for the kernel and an app developer to go through the getdents interface.

Further, getdents does things atomically. If you're reading directory entries in a given program, there may be other programs that are creating and deleting files in that directory or renaming them--right in the middle of your getdents sequence.

getdents will present an atomic view. Either a file exists or it doesn't. It's been renamed or it hasn't. So, you don't get a "half modified" view, regardless of how much "turmoil" is happening around you. When you ask getdents for 20 entries, you'll get them [or 10 if there's only that much].

Side note: A useful trick is to "overspecify" the count. That is, tell getdents you want 50,000 entries [you must provide the space]. You'll usually get back something like 100 or so. But, now, what you've got is an atomic snapshot in time for the full directory. I sometimes do this instead of looping with a count of 1--YMMV. You still have to protect against immediate disappearance but at least you can see it (i.e. a subsequent file open fails)

So, you always get "whole" entries and no entry for a just deleted file. That is not to say that the file is still there, merely that it was there at the time of the getdents. Another process may instantly erase it, but not in the middle of the getdents

If read(2) were allowed, you'd have to guess at how much data to read and wouldn't know which entries were fully formed on in a partial state. If the FS had the type B organization above, a single read could not atomically get the static portion and variable portion in a single step.

It would be philosophically incorrect to slow down read(2) to do what getdents does.

getdents, unlink, creat, rmdir, and rename (etc.) operations are interlocked and serialized to prevent any inconsistencies [not to mention FS corruption or leaked/lost FS blocks]. In other words, these syscalls all "know about each other".

If pgmA renames "x" to "z" and pgmB renames "y" to "z", they don't collide. One goes first and another second but no FS blocks are ever lost/leaked. getdents gets the whole view (be it "x y", "y z", "x z" or "z"), but it will never see "x y z" simultaneously.

这篇关于为什么 Linux 在目录上使用 getdents() 而不是 read()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆