如何分析二进制序列化流的内容? [英] How to analyse contents of binary serialization stream?

查看:24
本文介绍了如何分析二进制序列化流的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用二进制序列化 (BinaryFormatter) 作为临时机制,将状态信息存储在文件中,用于相对复杂的(游戏)对象结构;文件出来的比我预期的要大得多,而且我的数据结构包括递归引用 - 所以我想知道 BinaryFormatter 是否实际上存储了相同对象的多个副本,或者我的基本数字"我应该拥有的对象和值的数量"算法是偏离基础的,或者过大的尺寸来自哪里.

搜索堆栈溢出我能够找到 Microsoft 的二进制远程处理格式的规范:) 流中的每条记录都由 RecordTypeEnumeration 标识.2.1.2.1 RecordTypeNumeration 部分指出:

<块引用>

此枚举标识记录的类型.每条记录(MemberPrimitiveUnTyped 除外)都以记录类型枚举开始.枚举的大小为 1 BYTE.



SerializationHeaderRecord:

所以如果我们回顾一下我们得到的数据,我们可以开始解释第一个字节:

2.1.2.1 RecordTypeEnumeration 中所述,0 的值标识 2.6.1 SerializationHeaderRecord<中指定的 SerializationHeaderRecord/代码>:

<块引用>

SerializationHeaderRecord 记录必须是二进制序列化中的第一条记录.此记录具有格式的主要和次要版本以及顶级对象和标题的 ID.

它包括:

  • RecordTypeEnum(1 字节)
  • RootId(4 个字节)
  • HeaderId(4 个字节)
  • 主要版本(4 个字节)
  • 次要版本(4 个字节)



有了这些知识,我们可以解释包含 17 个字节的记录:

00 代表 RecordTypeEnumeration,在我们的例子中是 SerializationHeaderRecord.

01 00 00 00 代表RootId

<块引用>

如果 BinaryMethodCall 和 BinaryMethodReturn 记录都不存在于序列化流中,则该字段的值必须包含序列化流中包含的类、数组或 BinaryObjectString 记录的 ObjectId.

所以在我们的例子中,这应该是值为 1ObjectId(因为数据是使用 little-endian 序列化的),我们希望再次看到它;-)

FF FF FF FF 代表HeaderId

01 00 00 00 代表MajorVersion

00 00 00 00 代表 MinorVersion



二进制库:

按照规定,每条记录必须以 RecordTypeEnumeration 开头.随着最后一条记录完成,我们必须假设新的记录开始了.

让我们解释下一个字节:

如我们所见,在我们的示例中,SerializationHeaderRecord 后面是 BinaryLibrary 记录:

<块引用>

BinaryLibrary 记录将一个 INT32 ID(在 [MS-DTYP] 部分 2.2.22 中指定)与一个库名称相关联.这允许其他记录使用 ID 引用库名称.当有多个记录引用相同的库名称时,这种方法可以减少连线大小.

它包括:

  • RecordTypeEnum(1 字节)
  • LibraryId(4 个字节)
  • LibraryName(可变字节数(这是一个 LengthPrefixedString))



2.1.1.6 LengthPrefixedString...

所述<块引用>

LengthPrefixedString 代表一个字符串值.该字符串的前缀是 UTF-8 编码字符串的长度(以字节为单位).长度编码在可变长度字段中,最小为 1 个字节,最大为 5 个字节.为了最小化电线尺寸,长度被编码为一个可变长度字段.

在我们的简单示例中,长度始终使用 1 字节 进行编码.有了这些知识,我们可以继续解释流中的字节:

0C 表示 RecordTypeEnumeration,它标识 BinaryLibrary 记录.

02 00 00 00 代表 LibraryId,在我们的例子中是 2.



现在 LengthPrefixedString 如下:

42表示包含LibraryNameLengthPrefixedString的长度信息.

在我们的例子中,42(十进制 66)的长度信息告诉我们,我们需要读取接下来的 66 个字节并将它们解释为 LibraryName.>

如前所述,该字符串是 UTF-8 编码的,因此上述字节的结果将类似于:_WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null



ClassWithMembersAndTypes:

同样,记录是完整的,所以我们解释下一个的RecordTypeEnumeration:

05 标识一个 ClassWithMembersAndTypes 记录.2.3.2.1 ClassWithMembersAndTypes 部分指出:

<块引用>

ClassWithMembersAndTypes 记录是 Class 记录中最详细的.它包含有关成员的元数据,包括成员的名称和远程处理类型.它还包含引用类的库名称的库 ID.

它包括:

  • RecordTypeEnum(1 字节)
  • ClassInfo(可变字节数)
  • MemberTypeInfo(可变字节数)
  • LibraryId(4 个字节)



类信息:

2.3.1.1 ClassInfo所述,记录包括:

  • ObjectId(4 个字节)
  • 名称(可变字节数(也是 LengthPrefixedString))
  • MemberCount(4 字节)
  • MemberNames(它是 LengthPrefixedString 的序列,其中项目的数量必须等于 MemberCount 字段中指定的值.)



回到原始数据,一步一步:

01 00 00 00 代表ObjectId.我们已经看到了这个,它被指定为 SerializationHeaderRecord 中的 RootId.

0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 表示使用 LengthPrefixedString<表示的类的Name/代码>.如前所述,在我们的示例中,字符串的长度定义为 1 个字节,因此第一个字节 0F 指定必须使用 UTF-8 读取和解码 15 个字节.结果看起来像这样: StackOverFlow.A - 很明显我使用了 StackOverFlow 作为命名空间的名称.

02 00 00 00 代表 MemberCount,它告诉我们后面有 2 个成员,都用 LengthPrefixedString 表示.

第一位成员姓名:

1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64,代表第一个名称1B 也是字符串的长度,它的长度为 27 个字节,结果如下:k__BackingField.

第二名成员姓名:

1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64代表第二个名字M,Mcode>1A 指定字符串为 26 字节长.结果如下:k__BackingField.



会员类型信息:

ClassInfo 之后是 MemberTypeInfo.

2.3.1.2 - MemberTypeInfo 部分指出,该结构包含:

  • BinaryTypeEnums(长度可变)
<块引用>

表示正在传输的成员类型的 BinaryTypeEnumeration 值序列.数组必须:

  • 具有与 ClassInfo 结构的 MemberNames 字段相同数量的项目.

  • 排序使得 BinaryTypeEnumeration 对应于 ClassInfo 结构的 MemberNames 字段中的成员名称.

  • AdditionalInfos(长度可变),取决于 BinaryTpeEnum 附加信息可能存在也可能不存在.
<块引用>

<代码>|BinaryTypeEnum |附加信息 |
|----------------+----------------------------------------|
<代码>|原始 |PrimitiveTypeEnumeration |

<代码>|字符串 |无 |

所以考虑到这一点,我们快到了...我们期望 2 个 BinaryTypeEnumeration 值(因为我们在 MemberNames 中有 2 个成员).



再次回到完整的 MemberTypeInfo 记录的原始数据:

01 代表第一个成员的BinaryTypeEnumeration,根据2.1.2.2 BinaryTypeEnumeration我们可以期待一个String> 并使用 LengthPrefixedString 表示.

00 代表第二个成员的BinaryTypeEnumeration,同样,根据规范,它是一个Primitive.如上所述,Primitive 后面是附加信息,在本例中为 PrimitiveTypeEnumeration.这就是为什么我们需要读取下一个字节,即 08,将其与 2.1.2.3 PrimitiveTypeEnumeration 中所述的表进行匹配,并惊讶地注意到我们可以期待 08code>Int32 由 4 个字节表示,如其他一些关于基本数据类型的文档所述.



图书馆 ID:

MemerTypeInfo之后是LibraryId,用4个字节表示:

02 00 00 00 表示 LibraryId 为 2.



价值观:

2.3 Class Records中所述:

<块引用>

类成员的值必须序列化为该记录之后的记录,如第 2.7 节中所述.记录的顺序必须与 ClassInfo(第 2.3.1.1 节)结构中指定的 MemberName 的顺序相匹配.

这就是为什么我们现在可以期待成员的价值.

让我们看看最后几个字节:

06 标识一个 BinaryObjectString.它代表了我们的 SomeString 属性的值(准确地说是 k__BackingField).

根据2.5.7 BinaryObjectString,它包含:

  • RecordTypeEnum(1 字节)
  • ObjectId(4 个字节)
  • 值(可变长度,表示为LengthPrefixedString)



所以知道这一点,我们可以清楚地识别

03 00 00 00 代表ObjectId.

03 61 62 63 表示 Value 其中 03 是字符串本身的长度,61 62 63> 是转换为 abc 的内容字节.

希望你还记得有第二个成员,Int32.知道 Int32 用 4 个字节表示,我们可以得出结论,

必须是我们第二个成员的Value.7B 十六进制等于 123 十进制,这似乎适合我们的示例代码.

这里是完整的 ClassWithMembersAndTypes 记录:



消息结束:

最后一个字节0B代表MessageEnd记录.

I'm using binary serialization (BinaryFormatter) as a temporary mechanism to store state information in a file for a relatively complex (game) object structure; the files are coming out much larger than I expect, and my data structure includes recursive references - so I'm wondering whether the BinaryFormatter is actually storing multiple copies of the same objects, or whether my basic "number of objects and values I should have" arithmentic is way off-base, or where else the excessive size is coming from.

Searching on stack overflow I was able to find the specification for Microsoft's binary remoting format: http://msdn.microsoft.com/en-us/library/cc236844(PROT.10).aspx

What I can't find is any existing viewer that enables you to "peek" into the contents of a binaryformatter output file - get object counts and total bytes for different object types in the file, etc;

I feel like this must be my "google-fu" failing me (what little I have) - can anyone help? This must have been done before, right??


UPDATE: I could not find it and got no answers so I put something relatively quick together (link to downloadable project below); I can confirm the BinaryFormatter does not store multiple copies of the same object but it does print quite a lot of metadata to the stream. If you need efficient storage, build your own custom serialization methods.

解决方案

Because it is maybe of interest for someone I decided to do this post about What does the binary format of serialized .NET objects look like and how can we interpret it correctly?

I have based all my research on the .NET Remoting: Binary Format Data Structure specification.



Example class:

To have a working example, I have created a simple class called A which contains 2 properties, one string and one integer value, they are called SomeString and SomeValue.

Class A looks like this:

[Serializable()]
public class A
{
    public string SomeString
    {
        get;
        set;
    }

    public int SomeValue
    {
        get;
        set;
    }
}

For the serialization I used the BinaryFormatter of course:

BinaryFormatter bf = new BinaryFormatter();
StreamWriter sw = new StreamWriter("test.txt");
bf.Serialize(sw.BaseStream, new A() { SomeString = "abc", SomeValue = 123 });
sw.Close();

As can be seen, I passed a new instance of class A containing abc and 123 as values.



Example result data:

If we look at the serialized result in an hex editor, we get something like this:



Let us interpret the example result data:

According to the above mentioned specification (here is the direct link to the PDF: [MS-NRBF].pdf) every record within the stream is identified by the RecordTypeEnumeration. Section 2.1.2.1 RecordTypeNumeration states:

This enumeration identifies the type of the record. Each record (except for MemberPrimitiveUnTyped) starts with a record type enumeration. The size of the enumeration is one BYTE.



SerializationHeaderRecord:

So if we look back at the data we got, we can start interpreting the first byte:

As stated in 2.1.2.1 RecordTypeEnumeration a value of 0 identifies the SerializationHeaderRecord which is specified in 2.6.1 SerializationHeaderRecord:

The SerializationHeaderRecord record MUST be the first record in a binary serialization. This record has the major and minor version of the format and the IDs of the top object and the headers.

It consists of:

  • RecordTypeEnum (1 byte)
  • RootId (4 bytes)
  • HeaderId (4 bytes)
  • MajorVersion (4 bytes)
  • MinorVersion (4 bytes)



With that knowledge we can interpret the record containing 17 bytes:

00 represents the RecordTypeEnumeration which is SerializationHeaderRecord in our case.

01 00 00 00 represents the RootId

If neither the BinaryMethodCall nor BinaryMethodReturn record is present in the serialization stream, the value of this field MUST contain the ObjectId of a Class, Array, or BinaryObjectString record contained in the serialization stream.

So in our case this should be the ObjectId with the value 1 (because the data is serialized using little-endian) which we will hopefully see again ;-)

FF FF FF FF represents the HeaderId

01 00 00 00 represents the MajorVersion

00 00 00 00 represents the MinorVersion



BinaryLibrary:

As specified, each record must begin with the RecordTypeEnumeration. As the last record is complete, we must assume that a new one begins.

Let us interpret the next byte:

As we can see, in our example the SerializationHeaderRecord it is followed by the BinaryLibrary record:

The BinaryLibrary record associates an INT32 ID (as specified in [MS-DTYP] section 2.2.22) with a Library name. This allows other records to reference the Library name by using the ID. This approach reduces the wire size when there are multiple records that reference the same Library name.

It consists of:

  • RecordTypeEnum (1 byte)
  • LibraryId (4 bytes)
  • LibraryName (variable number of bytes (which is a LengthPrefixedString))



As stated in 2.1.1.6 LengthPrefixedString...

The LengthPrefixedString represents a string value. The string is prefixed by the length of the UTF-8 encoded string in bytes. The length is encoded in a variable-length field with a minimum of 1 byte and a maximum of 5 bytes. To minimize the wire size, length is encoded as a variable-length field.

In our simple example the length is always encoded using 1 byte. With that knowledge we can continue the interpretation of the bytes in the stream:

0C represents the RecordTypeEnumeration which identifies the BinaryLibrary record.

02 00 00 00 represents the LibraryId which is 2 in our case.



Now the LengthPrefixedString follows:

42 represents the length information of the LengthPrefixedString which contains the LibraryName.

In our case the length information of 42 (decimal 66) tell's us, that we need to read the next 66 bytes and interpret them as the LibraryName.

As already stated, the string is UTF-8 encoded, so the result of the bytes above would be something like: _WorkSpace_, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null



ClassWithMembersAndTypes:

Again, the record is complete so we interpret the RecordTypeEnumeration of the next one:

05 identifies a ClassWithMembersAndTypes record. Section 2.3.2.1 ClassWithMembersAndTypes states:

The ClassWithMembersAndTypes record is the most verbose of the Class records. It contains metadata about Members, including the names and Remoting Types of the Members. It also contains a Library ID that references the Library Name of the Class.

It consists of:

  • RecordTypeEnum (1 byte)
  • ClassInfo (variable number of bytes)
  • MemberTypeInfo (variable number of bytes)
  • LibraryId (4 bytes)



ClassInfo:

As stated in 2.3.1.1 ClassInfo the record consists of:

  • ObjectId (4 bytes)
  • Name (variable number of bytes (which is again a LengthPrefixedString))
  • MemberCount(4 bytes)
  • MemberNames (which is a sequence of LengthPrefixedString's where the number of items MUST be equal to the value specified in the MemberCount field.)



Back to the raw data, step by step:

01 00 00 00 represents the ObjectId. We've already seen this one, it was specified as the RootId in the SerializationHeaderRecord.

0F 53 74 61 63 6B 4F 76 65 72 46 6C 6F 77 2E 41 represents the Name of the class which is represented by using a LengthPrefixedString. As mentioned, in our example the length of the string is defined with 1 byte so the first byte 0F specifies that 15 bytes must be read and decoded using UTF-8. The result looks something like this: StackOverFlow.A - so obviously I used StackOverFlow as name of the namespace.

02 00 00 00 represents the MemberCount, it tell's us that 2 members, both represented with LengthPrefixedString's will follow.

Name of the first member:

1B 3C 53 6F 6D 65 53 74 72 69 6E 67 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the first MemberName, 1B is again the length of the string which is 27 bytes in length an results in something like this: <SomeString>k__BackingField.

Name of the second member:

1A 3C 53 6F 6D 65 56 61 6C 75 65 3E 6B 5F 5F 42 61 63 6B 69 6E 67 46 69 65 6C 64 represents the second MemberName, 1A specifies that the string is 26 bytes long. It results in something like this: <SomeValue>k__BackingField.



MemberTypeInfo:

After the ClassInfo the MemberTypeInfo follows.

Section 2.3.1.2 - MemberTypeInfo states, that the structure contains:

  • BinaryTypeEnums (variable in length)

A sequence of BinaryTypeEnumeration values that represents the Member Types that are being transferred. The Array MUST:

  • Have the same number of items as the MemberNames field of the ClassInfo structure.

  • Be ordered such that the BinaryTypeEnumeration corresponds to the Member name in the MemberNames field of the ClassInfo structure.

  • AdditionalInfos (variable in length), depending on the BinaryTpeEnum additional info may or may not be present.

| BinaryTypeEnum | AdditionalInfos |
|----------------+--------------------------|
| Primitive | PrimitiveTypeEnumeration |
| String | None |

So taking that into consideration we are almost there... We expect 2 BinaryTypeEnumeration values (because we had 2 members in the MemberNames).



Again, back to the raw data of the complete MemberTypeInfo record:

01 represents the BinaryTypeEnumeration of the first member, according to 2.1.2.2 BinaryTypeEnumeration we can expect a String and it is represented using a LengthPrefixedString.

00 represents the BinaryTypeEnumeration of the second member, and again, according to the specification, it is a Primitive. As stated above, Primitive's are followed by additional information, in this case a PrimitiveTypeEnumeration. That's why we need to read the next byte, which is 08, match it with the table stated in 2.1.2.3 PrimitiveTypeEnumeration and be surprised to notice that we can expect an Int32 which is represented by 4 bytes, as stated in some other document about basic datatypes.



LibraryId:

After the MemerTypeInfo the LibraryId follows, it is represented by 4 bytes:

02 00 00 00 represents the LibraryId which is 2.



The values:

As specified in 2.3 Class Records:

The values of the Members of the Class MUST be serialized as records that follow this record, as specified in section 2.7. The order of the records MUST match the order of MemberNames as specified in the ClassInfo (section 2.3.1.1) structure.

That's why we can now expect the values of the members.

Let us look at the last few bytes:

06 identifies an BinaryObjectString. It represents the value of our SomeString property (the <SomeString>k__BackingField to be exact).

According to 2.5.7 BinaryObjectString it contains:

  • RecordTypeEnum (1 byte)
  • ObjectId (4 bytes)
  • Value (variable length, represented as a LengthPrefixedString)



So knowing that, we can clearly identify that

03 00 00 00 represents the ObjectId.

03 61 62 63 represents the Value where 03 is the length of the string itself and 61 62 63 are the content bytes that translate to abc.

Hopefully you can remember that there was a second member, an Int32. Knowing that the Int32 is represented by using 4 bytes, we can conclude, that

must be the Value of our second member. 7B hexadecimal equals 123 decimal which seems to fit our example code.

So here is the complete ClassWithMembersAndTypes record:



MessageEnd:

Finally the last byte 0B represents the MessageEnd record.

这篇关于如何分析二进制序列化流的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆