将围绕 sockaddr_storage 和 sockaddr_in 进行转换以打破严格的别名 [英] will casting around sockaddr_storage and sockaddr_in break strict aliasing

查看:74
本文介绍了将围绕 sockaddr_storage 和 sockaddr_in 进行转换以打破严格的别名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照我之前的问题,我'我真的对这段代码很好奇 -

case AF_INET:{结构 sockaddr_in * tmp =reinterpret_cast(&addrStruct);tmp->sin_family = AF_INET;tmp->sin_port = htons(port);inet_pton(AF_INET, addr, tmp->sin_addr);}休息;

在问这个问题之前,我已经搜索过关于同一主题的 SO,并且对这个主题有不同的反应.例如,参见这个thisthis 帖子说使用这种代码在某种程度上是安全的.还有另一个帖子说将工会用于此类任务,但对已接受答案的评论再次出现分歧.

<小时>

Microsoft 的文档 在相同的结构上说 -

<块引用>

应用程序开发人员通常只使用 SOCKADDR_STORAGE 的 ss_family 成员.其余成员确保 SOCKADDR_STORAGE 可以包含 IPv6 或 IPv4 地址,并适当填充结构以实现 64 位对齐.这种对齐使协议特定的套接字地址数据结构能够访问 SOCKADDR_STORAGE 结构中的字段而不会出现对齐问题.加上填充后,SOCKADDR_STORAGE 结构的长度为 128 字节.

Opengroup 的文档 状态 -

<块引用>

标题应定义 sockaddr_storage 结构.该结构应为:

足够容纳所有支持的特定于协议的地址结构

在适当的边界对齐,以便指向它的指针可以转换为指向特定协议地址结构的指针,并用于访问这些结构的字段而不会出现对齐问题

socket 的手册页也说同样的 - ><块引用>

此外,sockets API 提供了数据类型 struct sockaddr_storage.这种类型适合容纳所有支持的特定于域的套接字地址结构;它足够大并且正确对齐.(特别是,它足够容纳 IPv6 套接字地址.)

<小时>

我已经在野外看到在 CC++ 语言中使用这种类型转换的多种实现,现在我不确定哪一个是正确的,因为那里是一些与上述声明相矛盾的帖子 - this这个.

那么哪一种是填充 sockaddr_storage 结构的安全且正确的方法?这些指针转换安全吗?还是联合方法?我也知道 getaddrinfo() 调用,但这对于上述仅填充结构的任务来说似乎有点复杂.还有另一种推荐的 memcpy 方法,这安全吗?

解决方案

C 和 C++ 编译器在过去十年中变得比设计 sockaddr 接口时更加复杂,甚至写了 C99.作为其中的一部分,对未定义行为"的理解目的发生了变化.过去,未定义的行为通常是为了弥补硬件 实现之间关于操作语义是什么的分歧.但是如今,最终要感谢许多希望停止编写 FORTRAN 并且有能力支付编译器工程师来实现这一目标的组织,未定义的行为是编译器用来推断代码.左移是一个很好的例子:C99 6.5.7p3,4(为了清晰起见稍微重新排列)读取

<块引用>

E1 的结果<E1左移的E2位位置;空出的位用零填充.如果 [E2] 的值为负或为大于或等于提升的 [E1] 的宽度,行为未定义.

例如,1u <<<33 是平台上的 UB,其中 unsigned int 是 32 位宽.委员会未定义这一点,因为在这种情况下,不同 CPU 架构的左移指令会做不同的事情:有些始终产生零,有些以类型 (x86) 的宽度为模减少移位计数,有些以较大的数字为模减少移位计数(ARM),并且至少有一种历史上常见的架构会陷入困境(我不知道是哪一种,但这就是它未定义和未指定的原因).但是现在,如果你写

unsigned int left_shift(unsigned int x, unsigned int y){ 返回 x <<y;}

在具有 32 位 unsigned int 的平台上,编译器知道上述 UB 规则,将推断 y 必须具有范围内的值0 到 32 当函数被调用时.它将将该范围提供给过程间分析,并使用它来执行诸如删除调用者中不必要的范围检查之类的事情.如果程序员有理由认为它们不是不必要的,那么现在您开始明白为什么这个主题如此繁琐.

有关未定义行为目的的这种变化的更多信息,请参阅 LLVM 人员关于该主题的三部分文章 (1 2 3).

<小时>

既然你明白了,我就可以回答你的问题了.

这些是struct sockaddrstruct sockaddr_instruct sockaddr_storage的定义,去掉了一些不相关的复杂性:

struct sockaddr {uint16_t sa_family;};结构 sockaddr_in {uint16_t sin_family;uint16_t sin_port;uint32_t sin_addr;};结构 sockaddr_storage {uint16_t ss_family;char __ss_storage[128 - (sizeof(uint16_t) + sizeof(unsigned long))];unsigned long int __ss_force_alignment;};

这是穷人的子类化.这是 C 中无处不在的习语.您定义一组结构,它们都具有相同的初始字段,这是一个代码编号,它告诉您实际传递了哪个结构.在过去,每个人都期望如果你分配并填充了一个 struct sockaddr_in,将它向上转换为 struct sockaddr,然后将它传递给例如connectconnect 的实现可以安全地解引用struct sockaddr 指针以检索sa_family 字段,了解它正在查看 sockaddr_in,将其扔回去,然后继续.C 标准一直说取消引用 struct sockaddr 指针会触发未定义的行为——这些规则自 C89 以来没有改变——但每个人都希望它在这种情况下是安全的,因为它无论您真正使用哪种结构,都将是相同的加载 16 位"指令.这就是 POSIX 和 Windows 文档谈论对齐的原因;早在 1990 年代,编写这些规范的人就认为,这实际上可能成为麻烦的主要方式是,如果您最终发布了未对齐的内存访问.

但是标准的文本没有说明加载指令,也没有说明对齐方式.这就是它所说的(C99 §6.5p7 + 脚注):

<块引用>

对象只能通过具有以下类型之一的左值表达式访问其存储值:73)

  • 与对象的有效类型兼容的类型,
  • 与对象的有效类型兼容的类型的限定版本,
  • 一个类型,它是对应于有效类型的有符号或无符号类型对象,
  • 一种类型,它是对应于限定版本的有符号或无符号类型对象的有效类型,
  • 一个聚合或联合类型,其中包括上述类型之一成员(包括递归地,子聚合或包含联合的成员),或
  • 一种字符类型.
<小时>

73) 此列表的目的是指定对象可以或不可以别名的情况.

struct 类型只与自身兼容",声明变量的有效类型"就是它的声明类型.所以你展示的代码...

struct sockaddr_storage addrStruct;/* ... */案例 AF_INET:{struct sockaddr_in * tmp = (struct sockaddr_in *)&addrStruct;tmp->sin_family = AF_INET;tmp->sin_port = htons(port);inet_pton(AF_INET, addr, tmp->sin_addr);}休息;

... 具有未定义的行为,编译器可以从中做出推断,即使 原始代码生成会按预期运行.现代编译器可能由此推断出case AF_INET 永远无法执行.它将整个块作为死代码删除,随之而来的是欢闹.

<小时>

那么如何安全地使用 sockaddr 呢?最短的答案是只需使用 getaddrinfogetnameinfo."他们为您处理这个问题.

但也许您需要使用地址族,例如 AF_UNIXgetaddrinfo 无法处理.在大多数情况下,您可以只为地址族声明一个正确类型的变量,并在调用采用 struct sockaddr *

的函数时转换它

int connect_to_unix_socket(const char *path, int type){结构 sockaddr_un 太阳;size_t plen = strlen(path);if (plen >= sizeof(sun.sun_path)) {错误号 = ENAMETOOLONG;返回-1;}sun.sun_family = AF_UNIX;memcpy(sun.sun_path,路径,plen+1);int sock = socket(AF_UNIX, type, 0);如果(袜子== -1)返回-1;if (connect(sock, (struct sockaddr *)&sun,offsetof(struct sockaddr_un, sun_path) + plen)) {int save_errno = errno;关闭(袜子);错误号 = 保存错误号;返回-1;}返回袜子;}

connect实现必须跳过一些障碍以确保安全,但这不是您的问题.

与另一个答案相反,在 一种情况下,您可能想要使用 sockaddr_storage;结合 getpeernamegetnameinfo,在需要处理 IPv4 和 IPv6 地址的服务器中.这是了解要分配多大缓冲区的便捷方法.

#ifndef NI_IDN#define NI_IDN 0#万一char *get_peer_hostname(int sock){char addrbuf[sizeof(struct sockaddr_storage)];socklen_t addrlen = sizeof addrbuf;if (getpeername(sock, (struct sockaddr *)addrbuf, &addrlen))返回0;char *peer_hostname = malloc(MAX_HOSTNAME_LEN+1);if (!peer_hostname) 返回 0;如果 (getnameinfo((struct sockaddr *)addrbuf, addrlen,peer_hostname, MAX_HOSTNAME_LEN+1,0, 0, NI_IDN) {免费(peer_hostname);返回0;}返回 peer_hostname;}

(我也可以写struct sockaddr_storage addrbuf,但我想强调的是,我实际上从来不需要直接访问addrbuf的内容.)>

最后一点:如果 BSD 的人定义了 sockaddr 结构只是一点不同......

struct sockaddr {uint16_t sa_family;};结构 sockaddr_in {结构 sockaddr sin_base;uint16_t sin_port;uint32_t sin_addr;};结构 sockaddr_storage {结构 sockaddr ss_base;char __ss_storage[128 - (sizeof(uint16_t) + sizeof(unsigned long))];unsigned long int __ss_force_alignment;};

... upcasts 和 downcasts 本来可以完美定义的,这要归功于包含上述类型之一的聚合或联合"规则.如果您想知道应该如何在新的 C 代码中处理这个问题,那么就来吧.

Following my previous question, I'm really curious about this code -

case AF_INET: 
    {
        struct sockaddr_in * tmp =
            reinterpret_cast<struct sockaddr_in *> (&addrStruct);
        tmp->sin_family = AF_INET;
        tmp->sin_port = htons(port);
        inet_pton(AF_INET, addr, tmp->sin_addr);
    }
    break;

Before asking this question, I've searched across SO about same topic and have got mix responses about this topic. For example, see this, this and this post which say that it is somehow safe to use this kind of code. Also there's another post that says to use unions for such task but again the comments on accepted answer beg to differ.


Microsoft's documentation on same structure says -

Application developers normally use only the ss_family member of the SOCKADDR_STORAGE. The remaining members ensure that the SOCKADDR_STORAGE can contain either an IPv6 or IPv4 address and the structure is padded appropriately to achieve 64-bit alignment. Such alignment enables protocol-specific socket address data structures to access fields within a SOCKADDR_STORAGE structure without alignment problems. With its padding, the SOCKADDR_STORAGE structure is 128 bytes in length.

Opengroup's documentation states -

The header shall define the sockaddr_storage structure. This structure shall be:

Large enough to accommodate all supported protocol-specific address structures

Aligned at an appropriate boundary so that pointers to it can be cast as pointers to protocol-specific address structures and used to access the fields of those structures without alignment problems

Man page of socket also says same -

In addition, the sockets API provides the data type struct sockaddr_storage. This type is suitable to accommodate all supported domain-specific socket address structures; it is large enough and is aligned properly. (In particular, it is large enough to hold IPv6 socket addresses.)


I've seen multiple implementation using such casts in both C and C++ languages in the wild and now I'm uncertain of the fact which one is right since there are some posts that contradict with above claims - this and this.

So which one is the safe and right way to fill up a sockaddr_storage structure? Are these pointer casts safe? or the union method? I'm also aware of the getaddrinfo() call but that seems a little complicated for the above task of just filling the structs. There is one other recommended way with memcpy, is this safe?

解决方案

C and C++ compilers have become much more sophisticated in the past decade than they were when the sockaddr interfaces were designed, or even when C99 was written. As part of that, the understood purpose of "undefined behavior" has changed. Back in the day, undefined behavior was usually intended to cover disagreement among hardware implementations as to what the semantics of an operation was. But nowadays, thanks ultimately to a number of organizations who wanted to stop having to write FORTRAN and could afford to pay compiler engineers to make that happen, undefined behavior is a thing that compilers use to make inferences about the code. Left shift is a good example: C99 6.5.7p3,4 (rearranged a little for clarity) reads

The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If the value of [E2] is negative or is greater than or equal to the width of the promoted [E1], the behavior is undefined.

So, for instance, 1u << 33 is UB on a platform where unsigned int is 32 bits wide. The committee made this undefined because different CPU architectures' left-shift instructions do different things in this case: some produce zero consistently, some reduce the shift count modulo the width of the type (x86), some reduce the shift count modulo some larger number (ARM), and at least one historically-common architecture would trap (I don't know which one, but that's why it's undefined and not unspecified). But nowadays, if you write

unsigned int left_shift(unsigned int x, unsigned int y)
{ return x << y; }

on a platform with 32-bit unsigned int, the compiler, knowing the above UB rule, will infer that y must have a value in the range 0 through 32 when the function is called. It will feed that range into interprocedural analysis, and use it to do things like remove unnecessary range checks in the callers. If the programmer has reason to think they aren't unnecessary, well, now you begin to see why this topic is such a can of worms.

For more on this change in the purpose of undefined behavior, please see the LLVM people's three-part essay on the subject (1 2 3).


Now that you understand that, I can actually answer your question.

These are the definitions of struct sockaddr, struct sockaddr_in, and struct sockaddr_storage, after eliding some irrelevant complications:

struct sockaddr {
    uint16_t sa_family;
};
struct sockaddr_in { 
    uint16_t sin_family;
    uint16_t sin_port;
    uint32_t sin_addr;
};
struct sockaddr_storage {
    uint16_t ss_family;
    char __ss_storage[128 - (sizeof(uint16_t) + sizeof(unsigned long))];
    unsigned long int __ss_force_alignment;
};

This is poor man's subclassing. It is a ubiquitous idiom in C. You define a set of structures that all have the same initial field, which is a code number that tells you which structure you've actually been passed. Back in the day, everyone expected that if you allocated and filled in a struct sockaddr_in, upcast it to struct sockaddr, and passed it to e.g. connect, the implementation of connect could dereference the struct sockaddr pointer safely to retrieve the sa_family field, learn that it was looking at a sockaddr_in, cast it back, and proceed. The C standard has always said that dereferencing the struct sockaddr pointer triggers undefined behavior—those rules are unchanged since C89—but everyone expected that it would be safe in this case because it would be the same "load 16 bits" instruction no matter which structure you were really working with. That's why POSIX and the Windows documentation talk about alignment; the people who wrote those specs, back in the 1990s, thought that the primary way this could actually be trouble was if you wound up issuing a misaligned memory access.

But the text of the standard doesn't say anything about load instructions, nor alignment. This is what it says (C99 §6.5p7 + footnote):

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:73)

  • a type compatible with the effective type of the object,
  • a qualified version of a type compatible with the effective type of the object,
  • a type that is the signed or unsigned type corresponding to the effective type of the object,
  • a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.

73) The intent of this list is to specify those circumstances in which an object may or may not be aliased.

struct types are "compatible" only with themselves, and the "effective type" of a declared variable is its declared type. So the code you showed...

struct sockaddr_storage addrStruct;
/* ... */
case AF_INET: 
{
    struct sockaddr_in * tmp = (struct sockaddr_in *)&addrStruct;
    tmp->sin_family = AF_INET;
    tmp->sin_port = htons(port);
    inet_pton(AF_INET, addr, tmp->sin_addr);
}
break;

... has undefined behavior, and compilers can make inferences from that, even though naive code generation would behave as expected. What a modern compiler is likely to infer from this is that the case AF_INET can never be executed. It will delete the entire block as dead code, and hilarity will ensue.


So how do you work with sockaddr safely? The shortest answer is "just use getaddrinfo and getnameinfo." They deal with this problem for you.

But maybe you need to work with an address family, such as AF_UNIX, that getaddrinfo doesn't handle. In most cases you can just declare a variable of the correct type for the address family, and cast it only when calling functions that take a struct sockaddr *

int connect_to_unix_socket(const char *path, int type)
{
    struct sockaddr_un sun;
    size_t plen = strlen(path);
    if (plen >= sizeof(sun.sun_path)) {
        errno = ENAMETOOLONG;
        return -1;
    }
    sun.sun_family = AF_UNIX;
    memcpy(sun.sun_path, path, plen+1);

    int sock = socket(AF_UNIX, type, 0);
    if (sock == -1) return -1;

    if (connect(sock, (struct sockaddr *)&sun,
                offsetof(struct sockaddr_un, sun_path) + plen)) {
        int save_errno = errno;
        close(sock);
        errno = save_errno;
        return -1;
    }
    return sock;
}

The implementation of connect has to jump through some hoops to make this safe, but that is Not Your Problem.

Contra the other answer, there is one case where you might want to use sockaddr_storage; in conjunction with getpeername and getnameinfo, in a server that needs to handle both IPv4 and IPv6 addresses. It is a convenient way to know how big of a buffer to allocate.

#ifndef NI_IDN
#define NI_IDN 0
#endif
char *get_peer_hostname(int sock)
{
    char addrbuf[sizeof(struct sockaddr_storage)];
    socklen_t addrlen = sizeof addrbuf;

    if (getpeername(sock, (struct sockaddr *)addrbuf, &addrlen))
        return 0;

    char *peer_hostname = malloc(MAX_HOSTNAME_LEN+1);
    if (!peer_hostname) return 0;

    if (getnameinfo((struct sockaddr *)addrbuf, addrlen,
                    peer_hostname, MAX_HOSTNAME_LEN+1,
                    0, 0, NI_IDN) {
        free(peer_hostname);
        return 0;
    }
    return peer_hostname;
}

(I could just as well have written struct sockaddr_storage addrbuf, but I wanted to emphasize that I never actually need to access the contents of addrbuf directly.)

A final note: if the BSD folks had defined the sockaddr structures just a little bit differently ...

struct sockaddr {
    uint16_t sa_family;
};
struct sockaddr_in { 
    struct sockaddr sin_base;
    uint16_t sin_port;
    uint32_t sin_addr;
};
struct sockaddr_storage {
    struct sockaddr ss_base;
    char __ss_storage[128 - (sizeof(uint16_t) + sizeof(unsigned long))];
    unsigned long int __ss_force_alignment;
};

... upcasts and downcasts would have been perfectly well-defined, thanks to the "aggregate or union that includes one of the aforementioned types" rule. If you're wondering how you should deal with this problem in new C code, here you go.

这篇关于将围绕 sockaddr_storage 和 sockaddr_in 进行转换以打破严格的别名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆