将 UnicodeString 转换为 AnsiString [英] Converting UnicodeString to AnsiString
问题描述
在过去,我有一个函数可以将 WideString
转换为指定代码页的 AnsiString
:
function WideStringToString(const Source: WideString; CodePage: UINT): AnsiString;...开始...//使用代码页将源 UTF-16 字符串 (WideString) 转换为目标strLen := WideCharToMultiByte(CodePage, 0,PWideChar(Source), Length(Source),//SourcePAnsiChar(cpStr), strLen,//目的地零,零);...结尾;
一切都奏效了.我向函数传递了一个 unicode 字符串(即 UTF-16 编码数据)并将其转换为 AnsiString
,并理解 AnsiString中的字节code> 表示来自指定代码页的字符.
例如:
TUnicodeHelper.WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);
将返回 Windows-1252
编码的字符串:
qùíçk brown fôx 跳过 lázÿ 狗
<块引用>
注意:在从完整的 Unicode 字符集转换为 Windows-1252 代码页的有限范围的过程中,信息当然会丢失:
Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ
(before)qùíçk brown fôx 跳过 lázÿ 狗
(之后)
但是 Windows WideChartoMultiByte
在最佳映射方面做得很好;正如它设计的那样.
现在是之后的时间
现在我们处于后期.WideString
现在是一个贱民,而 UnicodeString
是优点.这是一个无关紧要的变化;因为 Windows 函数只需要一个 指针 指向一系列 WideChar
反正(UnicodeString
也是).因此,我们将声明更改为使用 UnicodeString
代替:
函数 WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;开始...结尾;
现在我们来看看返回值.我有一个包含字节的 AnsiString
:
54 68 65 20 71 F9 ED E7 qùíç6B 20 62 72 6F 77 6E 20 k 棕色66 F4 78 20 6A 75 6D 70 狐狸跳65 64 20 6F 76 EA 72 20 版74 68 65 20 6C E1 7A FF lázÿ20 64 6F 67 狗
在过去,这很好.我跟踪了 AnsiString
实际包含的代码页;我必须记住返回的 AnsiString
不是使用计算机的语言环境(例如 Windows 1258)编码的,而是使用另一个代码页(CodePage
代码页).
但在 Delphi XE6 中,一个 AnsiString
也暗中包含代码页:
- 代码页: 1258
- 长度: 44
- 价值:
qùíçk brown fôx 跳过 lázÿ 狗
这个代码页是错误的.Delphi 指定了我的计算机的代码页,而不是字符串所在的代码页.从技术上讲,这不是问题,我总是明白 AnsiString
位于特定的代码页中,我只需要确保传递该信息即可.
所以当我想解码字符串时,我不得不传递代码页:
s := TUnicodeHeper.StringToWideString(s, 1252);
与
function StringToWideString(s: AnsiString; CodePage: UINT): UnicodeString;开始...MultiByteToWideChar(...);...结尾;
然后一个人把一切都搞砸了
问题是在过去我声明了一个名为 Utf8String
的类型:
类型utf8String = 类型 AnsiString;
因为它很常见:
function TUnicodeHelper.WideStringToUtf8(const s: UnicodeString): Utf8String;开始结果 := WideStringToString(s, CP_UTF8);结尾;
反之:
function TUnicodeHelper.Utf8ToWideString(const s: Utf8String): UnicodeString;开始结果 := StringToWideString(s, CP_UTF8);结尾;
现在在 XE6 中,我有一个接受Utf8String
的函数.如果某处的某些现有代码采用 UTF-8 编码的 AnsiString
,并尝试使用 Utf8ToWideString
将其转换为 UnicodeString,它将失败:
s: AnsiString;s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);...ws: UnicodeString;ws := Utf8ToWideString(s);//Delphi 会将 s 处理为 CP1252,并将其转换为 UTF8
或者更糟的是,现有代码的广度是:
s: Utf8String;s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);
返回的字符串将完全混乱:
- 该函数返回
AnsiString(1252)
(AnsiString
标记为使用当前代码页编码) - 返回结果存储在一个
AnsiString(65001)
字符串中 (Utf8String
) - Delphi 将 UTF-8 编码的字符串转换为 UTF-8,就好像它是 1252 一样.
如何前进
理想情况下,我的 UnicodeStringToString(string, codePage)
函数(返回一个 AnsiString
)可以设置字符串内的 CodePage
以匹配实际使用类似 SetCodePage
的代码页:
function UnicodeStringToString(s: UnicodeString; CodePage: UINT): AnsiString;开始...WideCharToMultiByte(...);...//调整 AnsiString 中包含的代码页以符合实际情况//SetCodePage(Result, CodePage, False);SetCodePage 仅适用于 RawByteString如果长度(结果)>0 那么PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;结尾;
除非手动处理 AnsiString
的内部结构是非常危险的.
那么返回 RawByteString
怎么样?
很多人(不是我)曾经说过,RawByteString
是通用接收者;它不应该作为返回参数:
function UnicodeStringToString(s: UnicodeString; CodePage: UINT): RawByteString;开始...WideCharToMultiByte(...);...//调整 AnsiString 中包含的代码页以符合实际情况SetCodePage(Result, CodePage, False);SetCodePage 仅适用于 RawByteString结尾;
这具有能够使用受支持和记录的 SetCodePage
的优点.
但是如果我们要越过一条线并开始返回RawByteString
,那么Delphi 肯定已经有一个函数可以将UnicodeString
转换为RawByteString
字符串,反之亦然:
function WideStringToString(const s: UnicodeString; CodePage: UINT): RawByteString;开始结果 := SysUtils.Something(s, CodePage);结尾;函数 StringToWideString(const s: RawByteString; CodePage: UINT): UnicodeString;开始结果 := SysUtils.SomethingElse(s, CodePage);结尾;
但它是什么?
或者我应该怎么做?
这是一个琐碎问题的冗长背景.真正的问题当然是,我应该做什么?有很多代码依赖于 UnicodeStringToString
,反之亦然.
tl;博士:
我可以通过以下方式将 UnicodeString
转换为 UTF:
Utf8Encode('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
并且我可以使用以下方法将 UnicodeString
转换为当前代码页:
AnsiString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
但是如何将 UnicodeString
转换为任意(未指定)代码页?
我的感觉是,因为一切都是AnsiString
:
Utf8String = AnsiString(65001);RawByteString = AnsiString(65535);
我应该咬紧牙关,打开AnsiString
结构,把正确的代码页戳进去:
function StringToAnsi(const s: UnicodeString; CodePage: UINT): AnsiString;开始LocaleCharsFromUnicode(CodePage, ..., s, ...);...如果长度(结果)>0 那么PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;结尾;
然后 VCL 的其余部分将符合要求.
在这种特殊情况下,使用 RawByteString
是一个合适的解决方案:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): RawByteString;无功strLen:整数;开始strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));如果 strLen >0 那么开始设置长度(结果,strLen);LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));SetCodePage(Result, CodePage, False);结尾;结尾;
这样,RawByteString
保存代码页,并将 RawByteString
分配给任何其他字符串类型,无论是 AnsiString
还是 UTF8String
或其他什么,将允许 RTL 自动将 RawByteString
数据从其当前代码页转换为目标字符串的代码页(包括转换为 UnicodeString
).
如果你绝对必须返回一个AnsiString
(我不推荐),你仍然可以通过类型转换使用SetCodePage()
:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;无功strLen:整数;开始strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));如果 strLen >0 那么开始设置长度(结果,strLen);LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));SetCodePage(PRawByteString(@Result)^, CodePage, False);结尾;结尾;
反过来更容易,只需使用已经存储在 (Ansi|RawByte)String
中的代码页(只需确保这些代码页始终准确),因为 RTL 已经知道如何检索和使用代码页:
function StringToWideString(const Source: AnsiString): UnicodeString;开始结果 := UnicodeString(Source);结尾;
function StringToWideString(const Source: RawByteString): UnicodeString;开始结果 := UnicodeString(Source);结尾;
话虽如此,我建议完全放弃辅助函数,而只使用类型化的字符串.让 RTL 为您处理转换:
类型Win1252String = 类型 AnsiString(1252);无功s: UnicodeString;a:Win1252String;开始s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';a := Win1252String(s);s := UnicodeString(a);结尾;
vars: UnicodeString;u: UTF8String;开始s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';u := UTF8String(s);s := UnicodeString(u);结尾;
In the olden times, i had a function that would convert a WideString
to an AnsiString
of the specified code-page:
function WideStringToString(const Source: WideString; CodePage: UINT): AnsiString;
...
begin
...
// Convert source UTF-16 string (WideString) to the destination using the code-page
strLen := WideCharToMultiByte(CodePage, 0,
PWideChar(Source), Length(Source), //Source
PAnsiChar(cpStr), strLen, //Destination
nil, nil);
...
end;
And everything worked. I passed the function a unicode string (i.e. UTF-16 encoded data) and converted it to an AnsiString
, with the understanding that the bytes in the AnsiString
represented characters from the specified code-page.
For example:
TUnicodeHelper.WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);
would return the Windows-1252
encoded string:
The qùíçk brown fôx jumped ovêr the lázÿ dog
Note: Information was of course lost during the conversion from the full Unicode character set to the limited confines of the Windows-1252 code page:
Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ
(before)The qùíçk brown fôx jumped ovêr the lázÿ dog
(after)
But the Windows WideChartoMultiByte
does a pretty good job of best-fit mapping; as it is designed to do.
Now the after times
Now we are in the after times. WideString
is now a pariah, with UnicodeString
being the goodness. It's an inconsequential change; as the Windows function only needed a pointer to a series of WideChar
anyway (which a UnicodeString
also is). So we change the declaration to use UnicodeString
instead:
funtion WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
begin
...
end;
Now we come to the return value. i have an AnsiString
that contains the bytes:
54 68 65 20 71 F9 ED E7 The qùíç
6B 20 62 72 6F 77 6E 20 k brown
66 F4 78 20 6A 75 6D 70 fôx jump
65 64 20 6F 76 EA 72 20 ed ovêr
74 68 65 20 6C E1 7A FF the lázÿ
20 64 6F 67 dog
In the olden times that was fine. I kept track of what code-page the AnsiString
actually contained; i had to remember that the returned AnsiString
was not encoded using the computer's locale (e.g. Windows 1258), but instead is encoded using another code-page (the CodePage
code page).
But in Delphi XE6 an AnsiString
also secretly contains the codepage:
- codePage: 1258
- length: 44
- value:
The qùíçk brown fôx jumped ovêr the lázÿ dog
This code-page is wrong. Delphi is specifying the code-page of my computer, rather than the code-page that the string is. Technically this is not a problem, i always understood that the AnsiString
was in a particular code-page, i just had to be sure to pass that information along.
So when i wanted to decode the string, i had to pass along the code-page with it:
s := TUnicodeHeper.StringToWideString(s, 1252);
with
function StringToWideString(s: AnsiString; CodePage: UINT): UnicodeString;
begin
...
MultiByteToWideChar(...);
...
end;
Then one person screws everything up
The problem was that in the olden times i declared a type called Utf8String
:
type
Utf8String = type AnsiString;
Because it was common enough to have:
function TUnicodeHelper.WideStringToUtf8(const s: UnicodeString): Utf8String;
begin
Result := WideStringToString(s, CP_UTF8);
end;
and the reverse:
function TUnicodeHelper.Utf8ToWideString(const s: Utf8String): UnicodeString;
begin
Result := StringToWideString(s, CP_UTF8);
end;
Now in XE6 i have a function that takes a Utf8String
. If some existing code somewhere were take a UTF-8 encoded AnsiString
, and try to convert it to UnicodeString using Utf8ToWideString
it would fail:
s: AnsiString;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);
...
ws: UnicodeString;
ws := Utf8ToWideString(s); //Delphi will treat s an CP1252, and convert it to UTF8
Or worse, is the breadth of existing code that does:
s: Utf8String;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);
The returned string will become totally mangled:
- the function returns
AnsiString(1252)
(AnsiString
tagged as encoded using the current codepage) - the return result is being stored in an
AnsiString(65001)
string (Utf8String
) - Delphi converts the UTF-8 encoded string into UTF-8 as though it was 1252.
How to move forward
Ideally my UnicodeStringToString(string, codePage)
function (which returns an AnsiString
) could set the CodePage
inside the string to match the actual code-page using something like SetCodePage
:
function UnicodeStringToString(s: UnicodeString; CodePage: UINT): AnsiString;
begin
...
WideCharToMultiByte(...);
...
//Adjust the codepage contained in the AnsiString to match reality
//SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
if Length(Result) > 0 then
PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;
Except that manually mucking around with the internal structure of an AnsiString
is horribly dangerous.
So what about returning RawByteString
?
It has been said, over an over, by a lot of people who aren't me that RawByteString
is meant to be the universal recipient; it wasn't meant to be as a return parameter:
function UnicodeStringToString(s: UnicodeString; CodePage: UINT): RawByteString;
begin
...
WideCharToMultiByte(...);
...
//Adjust the codepage contained in the AnsiString to match reality
SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
end;
This has the virtue of being able to use the supported and documented SetCodePage
.
But if we're going to cross a line, and start returning RawByteString
, surely Delphi already has a function that can convert a UnicodeString
to a RawByteString
string and vice versa:
function WideStringToString(const s: UnicodeString; CodePage: UINT): RawByteString;
begin
Result := SysUtils.Something(s, CodePage);
end;
function StringToWideString(const s: RawByteString; CodePage: UINT): UnicodeString;
begin
Result := SysUtils.SomethingElse(s, CodePage);
end;
But what is it?
Or what else should i do?
This was a long-winded set of background for a trivial question. The real question is, of course, what should i be doing instead? There is a lot of code out there that depends on the UnicodeStringToString
and the reverse.
tl;dr:
I can convert a UnicodeString
to UTF by doing:
Utf8Encode('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
and i can convert a UnicodeString
to the current code-page by using:
AnsiString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');
But how do i convert a UnicodeString
to an arbitrary (unspecified) code-page?
My feeling is that since everything really is an AnsiString
:
Utf8String = AnsiString(65001);
RawByteString = AnsiString(65535);
i should bite the bullet, bust open the AnsiString
structure, and poke the correct code-page into it:
function StringToAnsi(const s: UnicodeString; CodePage: UINT): AnsiString;
begin
LocaleCharsFromUnicode(CodePage, ..., s, ...);
...
if Length(Result) > 0 then
PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;
Then the rest of the VCL will fall in line.
In this particular case, using RawByteString
is an appropriate solution:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): RawByteString;
var
strLen: Integer;
begin
strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
if strLen > 0 then
begin
SetLength(Result, strLen);
LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
SetCodePage(Result, CodePage, False);
end;
end;
This way, the RawByteString
holds the codepage, and assigning the RawByteString
to any other string type, whether that be AnsiString
or UTF8String
or whatever, will allow the RTL to automatically convert the RawByteString
data from its current codepage to the destination string's codepage (which includes conversions to UnicodeString
).
If you absolutely must return an AnsiString
(which I do not recommend), you can still use SetCodePage()
via a typecast:
function WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
var
strLen: Integer;
begin
strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
if strLen > 0 then
begin
SetLength(Result, strLen);
LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
SetCodePage(PRawByteString(@Result)^, CodePage, False);
end;
end;
The reverse is much easier, just use the codepage already stored in a (Ansi|RawByte)String
(just make sure those codepages are always accurate), since the RTL already knows how to retrieve and use the codepage for you:
function StringToWideString(const Source: AnsiString): UnicodeString;
begin
Result := UnicodeString(Source);
end;
function StringToWideString(const Source: RawByteString): UnicodeString;
begin
Result := UnicodeString(Source);
end;
That being said, I would suggest dropping the helper functions altogether and just use typed strings instead. Let the RTL handle conversions for you:
type
Win1252String = type AnsiString(1252);
var
s: UnicodeString;
a: Win1252String;
begin
s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
a := Win1252String(s);
s := UnicodeString(a);
end;
var
s: UnicodeString;
u: UTF8String;
begin
s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
u := UTF8String(s);
s := UnicodeString(u);
end;
这篇关于将 UnicodeString 转换为 AnsiString的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!