Google协议缓冲区和std :: string用于任意二进制数据 [英] Google protocol buffers and use of std::string for arbitrary binary data

查看:75
本文介绍了Google协议缓冲区和std :: string用于任意二进制数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

相关问题: vector<未签名的字符> vs字符串以获取二进制数据.

我的代码将vector<unsigned char>用于任意二进制数据.但是,我的许多代码都必须与Google的协议缓冲区代码交互.协议缓冲区将std::string用于任意二进制数据.这造成了很多丑陋的分配/复制/空闲周期,只是在Google协议缓冲区和我的代码之间移动数据.在很多情况下,我需要两个构造函数(一个构造函数需要一个向量,一个构造函数需要一个字符串)或两个函数将一个函数转换为二进制有线格式.

My code uses vector<unsigned char> for arbitrary binary data. However, a lot of my code has to interface to Google's protocol buffers code. Protocol buffers uses std::string for arbitrary binary data. This makes for a lot of ugly allocate/copy/free cycles just to move data between Google protocol buffers and my code. It also makes for a lot of cases where I need two constructors (one which takes a vector and one a string) or two functions to convert a function to binary wire format.

该代码在内部处理大量原始结构,因为结构是内容可寻址的(通过哈希存储和检索),已签名等.因此,这不仅仅是与Google协议缓冲区的接口有关的问题.代码的其他部分也以原始形式处理对象.

The code deals with raw structures a lot internally because structures are content-addressable (stored and retrieved by hash), signed, and so on. So it's not just a matter of the interface to Google's protocol buffers. Objects are handled in raw forms in other parts of the code as well.

我可以做的一件事就是将我的所有代码剪切成std::string用于任意二进制数据.我可以做的另一件事是尝试找出更有效的方式来将向量存储和检索到Google协议缓冲区对象中.我猜我的另一选择是创建标准,简单但缓慢的字符串转换函数并始终使用它们.这样可以避免猖code的代码重复,但是从性能的角度来看,这是最糟糕的.

One thing I could do is just cut all my code over to use std::string for arbitrary binary data. Another thing I could do is try to work out more efficient ways to store and retrieve my vectors into Google protocol buffer objects. I guess my other choice would be to create standard, simple, but slow conversion functions to strings and always use them. This would avoid the rampant code duplication, but would be worst from a performance standpoint.

有什么建议吗?还有什么更好的选择?

Any suggestions? Any better choices I'm missing?

这是我要避免的事情:

if(SomeCase)
{
    std::vector<unsigned char> rawObject(objectdata().size());
    memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
    DoSometingWith(rawObject);
}

当原始数据已经存在时,分配,复制,处理,释放是完全没有意义的.

The allocate, copy, process, free is completely senseless when the raw data is already sitting there.

推荐答案

我知道有两种避免使用的复制方法.

There are two ways to avoid copying that I know of and have seen in use.

传统方式确实是将指针/引用传递给已知实体.尽管这可以正常工作并且大惊小怪,但问题是,它会将您与给定的表示形式联系起来,这需要在必要时进行转换(根据您的经验).

The traditional way is indeed to pass a pointer/reference to a known entity. While this works fine and with a minimum of fuss, the issue is that it ties you up to a given representation, which entails conversions (as you experienced) when necessary.

我用LLVM发现的另一种方法:

The other way I discovered with LLVM:

  • ArrayRef
  • StringRef

这个想法非常简单:都包含一个T*指向一个T数组的开始和一个size_t指示元素的数量.

The idea is amazingly simple: both hold a T* pointing to the start of an array of T and a size_t indicating the number of elements.

神奇的是它们完全隐藏了实际的存储空间,无论是stringvector,动态还是静态分配的C数组...都没关系.呈现的界面是完全统一的,不涉及任何副本.

What is magical is that they completely hide the actual storage, be it a string, a vector, a dynamically or statically allocated C-array... it does not matter. The interface presented is completely uniform and no copy is involved.

唯一的警告是,它们不拥有内存的所有权(Ref!),因此如果您不小心,可能会潜入一些细微的错误.不过,通常只在瞬态操作中使用它们(例如,在函数内),而不存储它们以备后用通常是可以的.

The only caveat is that they do not take ownership of the memory (Ref!) so subtle bugs might creep in if you do not take care. Still, it is usually fine if you only use them in transient operations (within a function, for example) and do not store them for later use.

我发现它们在缓冲区操作中非常方便,尤其是由于使用了免费的切片操作.范围比成对的迭代器要容易得多.

I have found them incredibly handy in buffer manipulations, especially thanks to the free slicing operations. Ranges are just so much easier to manipulate than pairs of iterators.

我还经历了第三种方法,但是直到现在还没有在严肃的代码中使用过.这个想法是vector<unsigned char>是非常底层的表示.通过提高抽象层并使用Buffer类,您可以完全封装内存的确切存储方式,从而就您的代码而言,它成为非问题.

There is also a third way I have experienced, but never used in serious code up until now. The idea is that a vector<unsigned char> is a very low-level representation. By raising the abstraction layer and use, say, a Buffer class, you can completely encapsulate the exact way the memory is stored so that it becomes a non-issue, as far as your code is concerned.

然后,随意选择一种需要较少转换的内存表示形式.

And then, feel free to choose the one memory representation that requires the less conversion.

这篇关于Google协议缓冲区和std :: string用于任意二进制数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆