无论是正数还是负数,通过char *缓冲区读取int的行为都不同 [英] Reading an int through char* buffer behaves different whether it is positive or negative

查看:102
本文介绍了无论是正数还是负数,通过char *缓冲区读取int的行为都不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:我想知道如果我们通过char *缓冲区将二进制数据反序列化(手动).

Background: I was wondering how to (manually) deserialize binary data if we got them through a char * buffer.

假设:作为最小示例,我们将在此处考虑:

Assumptions: As a minimal example, we will consider here that:

  • 我只有一个通过char*缓冲区序列化的int.
  • 我想从缓冲区取回原始的int.
  • sizeof(int) == 4在目标系统/平台上.
  • 目标系统/平台的字节序为 little-endian .
  • I have only one int serialized through a char* buffer.
  • I want to get the original int back from the buffer.
  • sizeof(int) == 4 on the target system/platform.
  • The endianness of the target system/platform is little-endian.

注意:这纯粹出于普遍兴趣,因此我不想使用与std::memcpy类似的任何东西,因为我们不会看到我遇到的奇怪行为.

Note: This is out of purely general interest therefore I don't want to use anything alike to std::memcpy because we'll not see the strange behaviour I encountered.

测试:我已经建立了以下测试用例:

Test: I have built the following test case:

#include <iostream>
#include <bitset>

int main()
{
    // Create neg_num and neg_num_bytes then display them
    int neg_num(-5000);
    char * neg_num_bytes = reinterpret_cast<char*>(&neg_num);
    display(neg_num, neg_num_bytes);

    std::cout << '\n';

    // Create pos_num and pos_num_bytes then display them
    int pos_num(5000);
    char * pos_num_bytes = reinterpret_cast<char*>(&pos_num);
    display(pos_num, pos_num_bytes);

    std::cout << '\n';

    // Get neg_num back from neg_num_bytes through bitmask operations
    int neg_num_back = 0;
    for(std::size_t i = 0; i < sizeof neg_num; ++i)
        neg_num_back |= static_cast<int>(neg_num_bytes[i]) << CHAR_BIT*i; // For little-endian

    // Get pos_num back from pos_num_bytes through bitmask operations
    int pos_num_back = 0;
    for(std::size_t i = 0; i < sizeof pos_num; ++i)
        pos_num_back |= static_cast<int>(pos_num_bytes[i]) << CHAR_BIT*i; // For little-endian

    std::cout << "Reconstructed neg_num: " << neg_num_back << ": " << std::bitset<CHAR_BIT*sizeof neg_num_back>(neg_num_back);
    std::cout << "\nReconstructed pos_num: " << pos_num_back << ":  " << std::bitset<CHAR_BIT*sizeof pos_num_back>(pos_num_back) << std::endl;

    return 0;
}

其中display()定义为:

// Warning: num_bytes must have a size of sizeof(int)
void display(int num, char * num_bytes)
{
    std::cout << num << " (from int)  : " << std::bitset<CHAR_BIT*sizeof num>(num) << '\n';
    std::cout << num << " (from char*): ";
    for(std::size_t i = 0; i < sizeof num; ++i)
        std::cout << std::bitset<CHAR_BIT>(num_bytes[sizeof num -1 -i]); // For little-endian
    std::cout << std::endl;
}

我得到的输出是:

-5000 (from int)  : 11111111111111111110110001111000
-5000 (from char*): 11111111111111111110110001111000

5000 (from int)  : 00000000000000000001001110001000
5000 (from char*): 00000000000000000001001110001000

Reconstructed neg_num: -5000: 11111111111111111110110001111000
Reconstructed pos_num: -120:  11111111111111111111111110001000

我知道测试案例代码很难阅读.简要说明一下:

I know the test case code is quite hard to read. To explain it briefly:

  • 我创建一个int.
  • 我创建一个char*数组,该数组指向先前创建的int的第一个字节(以模拟我在char*缓冲区中存储了真实的int).因此,其大小为4.
  • 我显示int及其二进制表示形式
  • 我显示intchar*缓冲区中存储的每个字节的连接,以比较它们是否相同(由于字节顺序的原因,顺序相反).
  • 尝试从缓冲区取回原始的int.
  • 我显示重建的int及其二进制表示形式.
  • I create an int.
  • I create a char* array pointing the first byte of the previously created int (to simulate that I have a real int stored in a char* buffer). Its size is consequently 4.
  • I display the int and its binary representation
  • I display the int and the concatenation of each bytes stored in the char* buffer to compare that they are the same (in reverse order due to endianness purposes).
  • Try to get the original int back from the buffer.
  • I display the reconstructed int as well as its binary representation.

我对负值和正值执行了此过程.这就是为什么代码的可读性差(对此感到抱歉).

I performed this procedure for both negative and positive values. This is why the code is less readable as it should be (sorry for that).

我们可以看到,负值可以成功地重建,但对于正值却不起作用(我期望5000并且得到了-120).

As we can see, the negative value could be reconstructed successfully, but it did not work for the positive one (I expected 5000 and I got -120).

我用其他几个负值和正值进行了测试,结论仍然是相同的,它在负数下可以正常工作,但在正数下不能工作.

I've made the test with several other negative values and positive values and the conclusion is still the same, it works fine with negative numbers but fails with positive numbers.

问题:我很难理解为什么当4个chars保持不变时,通过按位移位将4个chars连接为int为何会改变char值的正数负值?

Question: I'm in trouble to understand why does concatenating 4 chars into an int via bit-wise shifts change the char values for positive numbers when they stay unchanged with negative values ?

当我们查看二进制表示形式时,我们可以看到重构的数字不是由我串联的char组成.

When we look at the binary representation, we can see that the reconstructed numbers is not composed of the chars that I have concatenated.

static_cast<int>有关吗?如果我删除了它,则积分提升规则将隐式地应用它.但是我需要这样做,因为我需要将其转换为int,以免丢失转换的结果.
如果这是问题的核心,如何解决?

Is it related with the static_cast<int> ? If I remove it, the integral promotion rule will implicitly apply it anyway. But I need this to be done since I need to convert it into an int in order to not lose the result of the shifts.
If this is the heart of the issue, how to solve it ?

此外:是否有比逐位移位更好的方法来取回值?不依赖于系统/平台的字节序的东西.

Additionally: Is there a better way to get back the value than bit-wise shifting ? Something that is not dependent to the endianness of the system/platform.

也许这应该是另一个单独的问题.

推荐答案

有两个主要因素会影响结果:

There are two main things that affect the outcome here:

  • char类型可以是有符号的,也可以是无符号的,这是编译器保留的实现细节.
  • 发生整数转换时,带符号的值会被符号扩展.
  • The type char can be signed or unsigned, it's an implementation detail left to the compiler.
  • When integer conversion happens, signed values are sign-extended.

在这里可能发生的事情是char在您的系统上以及使用您的编译器进行了签名.这意味着当您将字节转换为int并将高位设置为1时,该值将被符号扩展(例如,二进制10000001将被符号扩展至1111111111111111111111111000001).

What is probably happening here is that char is signed on your system and with your compiler. That means when you convert the byte to an int and the high bit is set, the value will be sign-extended (for example binary 10000001 will be sign-extended to 1111111111111111111111111000001).

这当然会影响您的按位操作.

This of course affect your bitwise operations.

解决方案是使用显式的 unsigned 数据类型,即unsigned char.我还建议您将unsigned int(或uint32_t)用于类型转换和数据的临时存储,并且仅将完整结果转换为纯int.

The solution is to use an explicit unsigned data type, i.e. unsigned char. I also suggest you use unsigned int (or uint32_t) for your type-conversions and temporary storage of the data, and only convert the full result to plain int.

这篇关于无论是正数还是负数,通过char *缓冲区读取int的行为都不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆