内存使用序列化分块的字节数组与protobuf网 [英] Memory usage serializing chunked byte arrays with Protobuf-net

查看:271
本文介绍了内存使用序列化分块的字节数组与protobuf网的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我们的应用程序,我们有(其中包括)都含有字节分块列出了一些数据结构(目前公开为列表<字节[]> )。我们字节块,因为如果我们允许把大对象堆,然后随着时间的推移,我们从内存碎片遭受字节数组。



我们也已经开始使用的Protobuf -net序列化这些结构,用我们自己生成的序列化DLL。



不过,我们注意到,protobuf网正在创造非常大的内存缓冲区时序列化。通过源代码,一眼看来,也许这不能刷新其内部缓冲区,直到整个列表<字节[]> 结构已被写入,因为它需要写在事后缓冲区的前面总长度。



这不幸的是撤销我们的工作摆在首位分块的字节数,并最终使我们因OutOfMemoryExceptions内存碎片(中发生在哪里protobuf网正试图扩大缓冲区超过84K,这显然把它放在蕙,和我们的整体进程内存使用是相当低)时除外。



如果我的protobuf网是如何工作的分析是正确的,是有解决这个问题的方法吗?






更新



根据马克的答案,这里是我试过:

  [ProtoContract] 
[ProtoInclude(1的typeof(A),DATAFORMAT = DataFormat.Group)]
公共类ABASE
{
}

[ProtoContract]
公共类答:ABASE
{
[ProtoMember(1,DATAFORMAT = DataFormat.Group)]
公共BB
{
获得;
组;
}
}

[ProtoContract]
公共类B
{
[ProtoMember(1,DATAFORMAT = DataFormat.Group)
酒店的公共列表与LT;字节[]>数据
{
获得;
组;
}
}



然后序列化:

  VAR一个=新的A(); 
变种B =新的B();
A·B = B;
b.Data =新的List<字节[]>
{
Enumerable.Range(0,1999)。选择(V =>(字节)V).ToArray(),
Enumerable.Range(2000,3999)。选择(V = GT;(字节)v).ToArray(),
};

变种流=新的MemoryStream();
Serializer.Serialize(流,一个);



但是,如果我坚持一个断点 ProtoWriter.WriteBytes()它调用 DemandSpace()对方法的底部并步入 DemandSpace(),我可以看到,缓冲区不会被刷新,因为 writer.flushLock 等于 1



如果我创建另一个基类,像这样ABASE:

  [ProtoContract] 
[ProtoInclude(1的typeof(ABASE),DATAFORMAT = DataFormat.Group)]
公共类ABaseBase
{
}

[ProtoContract]
[ ProtoInclude(1的typeof(A),DATAFORMAT = DataFormat.Group)]
公共类ABASE:ABaseBase
{
}

然后 writer.flushLock 等于 2 DemandSpace()



我猜有我在这里错过了与派生类型做一个明显的阶梯<? / p>

解决方案

我要在这里一些字里行间......因为列表< T> (映射为在protobuf的说法重复)不会有一个整体的长度前缀,而字节[] (映射为字节)有一个平凡的长度前缀应该不会造成额外的缓冲。所以我猜你的真正的有更像是:



  [ProtoContract] 
酒店的公共类A {
[ProtoMember(1)]
公众b的Foo {获取;设置;}
}
[ProtoContract]
公共类b {
[ProtoMember(1)]
公开名单<字节[]>酒吧{获取;集;}
}



在这里,需要缓冲的长度-prefix实际上是写 A.Foo 时,基本的声明的下面复杂的数据是 A的值。美孚)。幸运的是有一个简单的解决方法:

  [ProtoMember(1,DATAFORMAT = DataFormat.Group)] 
公众B的Foo {获取;集;}

这2包装技术之间protobuf的变化:




  • 默认(谷歌的声明偏好)是长度为前缀的,这意味着你会得到一个标志,指示消息的长度效仿,那么子消息负载
  • ,但也有使用启动标记时,子消息的有效载荷,和一个结束标记物


$一个选项b $ b

当利用第二技术的它并不需要缓冲的,这样:它没有。这是否意味着它会为相同的数据被写入略有不同的字节,但是protobuf网是非常宽容的,并愉快地从反序列化的或者的格式,这里的数据。意思是:如果你把这个变化,你仍然可以读取现有的数据,但新数据将使用起始/终止标记技术



这需要一个问题:为什么谷歌不喜欢长度前缀的方法吗?的可能的,这是因为它是更有效的阅读时穿过田野跳过(通过原始阅读器的API,或者不想要的/意外的数据)使用长度前缀的方法时, ,你可以只读取长度前缀,然后只需进步的流[N]字节;相比之下,与一个开始/结束标志,你仍然需要通过有效载荷抓取跳过数据,分别跳过子字段。当然,如果你的期望的数据,并希望把它读入你的对象,你几乎可以肯定做读取性能这一理论的区别并不适用。此外,在谷歌protobuf的实施,因为它不是一个普通POCO模式工作时,有效载荷的大小是已知的,所以他们没有真正看到书写时同样的问题。


In our application we have some data structures which amongst other things contain a chunked list of bytes (currently exposed as a List<byte[]>). We chunk bytes up because if we allow the byte arrays to be put on the large object heap then over time we suffer from memory fragmentation.

We've also started using Protobuf-net to serialize these structures, using our own generated serialization DLL.

However we've noticed that Protobuf-net is creating very large in-memory buffers while serializing. Glancing through the source code it appears that perhaps it can't flush its internal buffer until the entire List<byte[]> structure has been written because it needs to write the total length at the front of the buffer afterwards.

This unfortunately undoes our work with chunking the bytes in the first place, and eventually gives us OutOfMemoryExceptions due to memory fragmentation (the exception occurs at the time where Protobuf-net is trying to expand the buffer to over 84k, which obviously puts it on the LOH, and our overall process memory usage is fairly low).

If my analysis of how Protobuf-net is working is correct, is there a way around this issue?


Update

Based on Marc's answer, here is what I've tried:

[ProtoContract]
[ProtoInclude(1, typeof(A), DataFormat = DataFormat.Group)]
public class ABase
{
}

[ProtoContract]
public class A : ABase
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public B B
    {
        get;
        set;
    }
}

[ProtoContract]
public class B
{
    [ProtoMember(1, DataFormat = DataFormat.Group)]
    public List<byte[]> Data
    {
        get;
        set;
    }
}

Then to serialize it:

var a = new A();
var b = new B();
a.B = b;
b.Data = new List<byte[]>
{
    Enumerable.Range(0, 1999).Select(v => (byte)v).ToArray(),
    Enumerable.Range(2000, 3999).Select(v => (byte)v).ToArray(),
};

var stream = new MemoryStream();
Serializer.Serialize(stream, a);

However if I stick a breakpoint in ProtoWriter.WriteBytes() where it calls DemandSpace() towards the bottom of the method and step into DemandSpace(), I can see that the buffer isn't being flushed because writer.flushLock equals 1.

If I create another base class for ABase like this:

[ProtoContract]
[ProtoInclude(1, typeof(ABase), DataFormat = DataFormat.Group)]
public class ABaseBase
{
}

[ProtoContract]
[ProtoInclude(1, typeof(A), DataFormat = DataFormat.Group)]
public class ABase : ABaseBase
{
}

Then writer.flushLock equals 2 in DemandSpace().

I'm guessing there is an obvious step I've missed here to do with derived types?

解决方案

I'm going to read between some lines here... because List<T> (mapped as repeated in protobuf parlance) doesn't have an overall length-prefix, and byte[] (mapped as bytes) has a trivial length-prefix that shouldn't cause additional buffering. So I'm guessing what you actually have is more like:

[ProtoContract]
public class A {
    [ProtoMember(1)]
    public B Foo {get;set;}
}
[ProtoContract]
public class B {
    [ProtoMember(1)]
    public List<byte[]> Bar {get;set;}
}

Here, the need to buffer for a length-prefix is actually when writing A.Foo, basically to declare "the following complex data is the value for A.Foo"). Fortunately there is a simple fix:

[ProtoMember(1, DataFormat=DataFormat.Group)]
public B Foo {get;set;}

This changes between 2 packing techniques in protobuf:

  • the default (google's stated preference) is length-prefixed, meaning you get a marker indicating the length of the message to follow, then the sub-message payload
  • but there is also an option to use a start-marker, the sub-message payload, and an end-marker

When using the second technique it doesn't need to buffer, so: it doesn't. This does mean it will be writing slightly different bytes for the same data, but protobuf-net is very forgiving, and will happily deserialize data from either format here. Meaning: if you make this change, you can still read your existing data, but new data will use the start/end-marker technique.

This demands the question: why do google prefer the length-prefix approach? Probably this is because it is more efficient when reading to skip through fields (either via a raw reader API, or as unwanted/unexpected data) when using the length-prefix approach, as you can just read the length-prefix, and then just progress the stream [n] bytes; by contrast, to skip data with a start/end-marker you still need to crawl through the payload, skipping the sub-fields individually. Of course, this theoretical difference in read performance doesn't apply if you expect that data and want to read it into your object, which you almost certainly do. Also, in the google protobuf implementation, because it isn't working with a regular POCO model, the size of the payloads are already known, so they don't really see the same issue when writing.

这篇关于内存使用序列化分块的字节数组与protobuf网的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆