Delphi Unicode字符串长度(以字节为单位) [英] Delphi Unicode String Length in Bytes

查看:428
本文介绍了Delphi Unicode字符串长度(以字节为单位)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将一些Delphi 7代码移植到XE4,所以unicode是这里的主题。



我有一种方法,其中字符串写入TMemoryStream,所以根据这个embarcadero文章,我应该乘以字符串的长度(字符)乘以大小的Char类型获取WriteBuffer的长度(以字节为单位)参数所需的字节长度。



所以之前:

  rawHtml:string; // AnsiString 
...
memorystream1.WriteBuffer(指针(rawHtml)^,Length(rawHtml);

after:

  rawHtml:string; // UnicodeString 
...
memorystream1.WriteBuffer(指针(rawHtml)^,Length(rawHtml)* SizeOf(Char));

我对Delphi的UnicodeString类型的理解是它在内部是UTF-16,但是我对Unicode的一般理解是,即使在2个字节中也不能表示所有的unicode字符,所以一些角色的外部字符将占用4个字节。另一个embarcadero的文章似乎证实了我的怀疑,事实上,一个Char等于两个并不总是真的字节!



所以...让我想知道是否 Length(rawHtml)* SizeOf(Char)真的会有足够的强度来保持一致的准确性,还是确定一个更好的方法来确定它的大小e字符串将更准确?

解决方案


我对Delphi的UnicodeString类型的理解是它的UTF -16
内部。


您对于Delphi的 UnicodeString 的UTF-16编码是正确的。这意味着一个16位字符足够大,可以代表代码点 /en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane\">基本的多语言平面正好一个 Char 元素 string array。


但是我对Unicode的一般理解是,并不是所有的
unicode字符都可以被表示,字节,一些
角色外壳字符将占用4个字节。


但是,这里有一点误解。 长度功能不会对字符进行任何深入检查,只需返回16位 WideChar 元素的数字,而不会进入记录您的字符串中的任何代理。这意味着如果您将补充计划中的任何一个角色分配给 UnicodeString 长度将返回2.

 程序埃及; 

{$ APPTYPE CONSOLE}

var
S:UnicodeString;

begin
S:=#$ 1304E; // single char
Writeln(Length(S));
Readln;
结束。






结论:字节字符串数据的大小始终是固定的,并且等于长度(S)* SizeOf(Char),无论 S 任何可变长度的字符。


I'm working on porting some Delphi 7 code to XE4, so, unicode is the subject here.

I have a method where a string gets written to a TMemoryStream, so according to this embarcadero article, I should multiply the length of the string (in characters) times the size of the Char type to get the length in bytes that is needed for the length (in bytes) parameter to WriteBuffer.

so before:

rawHtml : string; //AnsiString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml);

after:

rawHtml : string; //UnicodeString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));

My understanding of Delphi's UnicodeString type is that it's UTF-16 internally. But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes. Another of embarcadero's articles seems to confirm that my suspicions, "In fact, it isn’t even always true that one Char is equal to two bytes!"

So...that leaves me wondering whether Length(rawHtml)* SizeOf(Char) is really going to be robust enough to be consistently accurate, or whether there's a better way to determine the size of the string that will be more accurate?

解决方案

My understanding of Delphi's UnicodeString type is that it's UTF-16 internally.

You are correct about UTF-16 encoding of Delphi's UnicodeString. This means what one 16-bit character is wide enough to represent all code points from the Basic Multilingual Plane as exactly one Char element of string array.

But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes.

However, you've got a little misconception here. Length function does not perform any deep inspection of characters and simply returns number of 16-bit WideChar elements, without taking into account any surrogates within your string. This means what if you assign a single character from any of Supplementary Planes to the UnicodeString, Length will return 2.

program Egyptian;

{$APPTYPE CONSOLE}

var
  S: UnicodeString;

begin
  S := #$1304E;  // single char
  Writeln(Length(S));
  Readln;
end.


Conclusion: byte size of string data is always fixed and equals Length(S) * SizeOf(Char), no matter if S contains any variable-length characters.

这篇关于Delphi Unicode字符串长度(以字节为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆