将 UnicodeString 转换为 AnsiString [英] Converting UnicodeString to AnsiString

查看:24
本文介绍了将 UnicodeString 转换为 AnsiString的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去,我有一个函数可以将 WideString 转换为指定代码页的 AnsiString:

function WideStringToString(const Source: WideString; CodePage: UINT): AnsiString;...开始...//使用代码页将源 UTF-16 字符串 (WideString) 转换为目标strLen := WideCharToMultiByte(CodePage, 0,PWideChar(Source), Length(Source),//SourcePAnsiChar(cpStr), strLen,//目的地零,零);...结尾;

一切都奏效了.我向函数传递了一个 unicode 字符串(即 UTF-16 编码数据)并将其转换为 AnsiString,并理解 AnsiString 表示来自指定代码页的字符.

例如:

TUnicodeHelper.WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);

将返回 Windows-1252 编码的字符串:

qùíçk brown fôx 跳过 lázÿ 狗

<块引用>

注意:在从完整的 Unicode 字符集转换为 Windows-1252 代码页的有限范围的过程中,信息当然会丢失:

  • Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ (before)
  • qùíçk brown fôx 跳过 lázÿ 狗 (之后)

但是 Windows WideChartoMultiByte 在最佳映射方面做得很好;正如它设计的那样.

现在是之后的时间

现在我们处于后期.WideString 现在是一个贱民,而 UnicodeString 是优点.这是一个无关紧要的变化;因为 Windows 函数只需要一个 指针 指向一系列 WideChar 反正(UnicodeString 也是).因此,我们将声明更改为使用 UnicodeString 代替:

函数 WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;开始...结尾;

现在我们来看看返回值.我有一个包含字节的 AnsiString:

54 68 65 20 71 F9 ED E7 qùíç6B 20 62 72 6F 77 6E 20 k 棕色66 F4 78 20 6A 75 6D 70 狐狸跳65 64 20 6F 76 EA 72 20 版74 68 65 20 6C E1 7A FF lázÿ20 64 6F 67 狗

在过去,这很好.我跟踪了 AnsiString 实际包含的代码页;我必须记住返回的 AnsiString 不是使用计算机的语言环境(例如 Windows 1258)编码的,而是使用另一个代码页(CodePage 代码页).

但在 Delphi XE6 中,一个 AnsiString 也暗中包含代码页:

  • 代码页: 1258
  • 长度: 44
  • 价值: qùíçk brown fôx 跳过 lázÿ 狗

这个代码页是错误的.Delphi 指定了我的计算机的代码页,而不是字符串所在的代码页.从技术上讲,这不是问题,我总是明白 AnsiString 位于特定的代码页中,我只需要确保传递该信息即可.

所以当我想解码字符串时,我不得不传递代码页:

s := TUnicodeHeper.StringToWideString(s, 1252);

function StringToWideString(s: AnsiString; CodePage: UINT): UnicodeString;开始...MultiByteToWideChar(...);...结尾;

然后一个人把一切都搞砸了

问题是在过去我声明了一个名为 Utf8String 的类型:

类型utf8String = 类型 AnsiString;

因为它很常见:

function TUnicodeHelper.WideStringToUtf8(const s: UnicodeString): Utf8String;开始结果 := WideStringToString(s, CP_UTF8);结尾;

反之:

function TUnicodeHelper.Utf8ToWideString(const s: Utf8String): UnicodeString;开始结果 := StringToWideString(s, CP_UTF8);结尾;

现在在 XE6 中,我有一个接受Utf8String 的函数.如果某处的某些现有代码采用 UTF-8 编码的 AnsiString,并尝试使用 Utf8ToWideString 将其转换为 UnicodeString,它将失败:

s: AnsiString;s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);...ws: UnicodeString;ws := Utf8ToWideString(s);//Delphi 会将 s 处理为 CP1252,并将其转换为 UTF8

或者更糟的是,现有代码的广度是:

s: Utf8String;s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);

返回的字符串将完全混乱:

  • 该函数返回AnsiString(1252)(AnsiString 标记为使用当前代码页编码)
  • 返回结果存储在一个 AnsiString(65001) 字符串中 (Utf8String)
  • Delphi 将 UTF-8 编码的字符串转换为 UTF-8,就好像它是 1252 一样.

如何前进

理想情况下,我的 UnicodeStringToString(string, codePage) 函数(返回一个 AnsiString)可以设置字符串内的 CodePage 以匹配实际使用类似 SetCodePage 的代码页:

function UnicodeStringToString(s: UnicodeString; CodePage: UINT): AnsiString;开始...WideCharToMultiByte(...);...//调整 AnsiString 中包含的代码页以符合实际情况//SetCodePage(Result, CodePage, False);SetCodePage 仅适用于 RawByteString如果长度(结果)>0 那么PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;结尾;

除非手动处理 AnsiString 的内部结构是非常危险的.

那么返回 RawByteString 怎么样?

很多人(不是我)曾经说过,RawByteString通用接收者;它不应该作为返回参数:

function UnicodeStringToString(s: UnicodeString; CodePage: UINT): RawByteString;开始...WideCharToMultiByte(...);...//调整 AnsiString 中包含的代码页以符合实际情况SetCodePage(Result, CodePage, False);SetCodePage 仅适用于 RawByteString结尾;

这具有能够使用受支持和记录的 SetCodePage 的优点.

但是如果我们要越过一条线并开始返回RawByteString,那么Delphi 肯定已经有一个函数可以将UnicodeString 转换为RawByteString 字符串,反之亦然:

function WideStringToString(const s: UnicodeString; CodePage: UINT): RawByteString;开始结果 := SysUtils.Something(s, CodePage);结尾;函数 StringToWideString(const s: RawByteString; CodePage: UINT): UnicodeString;开始结果 := SysUtils.SomethingElse(s, CodePage);结尾;

但它是什么?

或者我应该怎么做?

这是一个琐碎问题的冗长背景.真正的问题当然是,我应该做什么?有很多代码依赖于 UnicodeStringToString,反之亦然.

tl;博士:

我可以通过以下方式将 UnicodeString 转换为 UTF:

Utf8Encode('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');

并且我可以使用以下方法将 UnicodeString 转换为当前代码页:

AnsiString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');

但是如何将 UnicodeString 转换为任意(未指定)代码页?

我的感觉是,因为一切都是AnsiString:

Utf8String = AnsiString(65001);RawByteString = AnsiString(65535);

我应该咬紧牙关,打开AnsiString结构,把正确的代码页戳进去:

function StringToAnsi(const s: UnicodeString; CodePage: UINT): AnsiString;开始LocaleCharsFromUnicode(CodePage, ..., s, ...);...如果长度(结果)>0 那么PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;结尾;

然后 VCL 的其余部分将符合要求.

解决方案

在这种特殊情况下,使用 RawByteString 是一个合适的解决方案:

function WideStringToString(const Source: UnicodeString; CodePage: UINT): RawByteString;无功strLen:整数;开始strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));如果 strLen >0 那么开始设置长度(结果,strLen);LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));SetCodePage(Result, CodePage, False);结尾;结尾;

这样,RawByteString 保存代码页,并将 RawByteString 分配给任何其他字符串类型,无论是 AnsiString 还是 UTF8String 或其他什么,将允许 RTL 自动将 RawByteString 数据从其当前代码页转换为目标字符串的代码页(包括转换为 UnicodeString).

如果你绝对必须返回一个AnsiString(我不推荐),你仍然可以通过类型转换使用SetCodePage():

function WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;无功strLen:整数;开始strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));如果 strLen >0 那么开始设置长度(结果,strLen);LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));SetCodePage(PRawByteString(@Result)^, CodePage, False);结尾;结尾;

反过来更容易,只需使用已经存储在 (Ansi|RawByte)String 中的代码页(只需确保这些代码页始终准确),因为 RTL 已经知道如何检索和使用代码页:

function StringToWideString(const Source: AnsiString): UnicodeString;开始结果 := UnicodeString(Source);结尾;

function StringToWideString(const Source: RawByteString): UnicodeString;开始结果 := UnicodeString(Source);结尾;

话虽如此,我建议完全放弃辅助函数,而只使用类型化的字符串.让 RTL 为您处理转换:

类型Win1252String = 类型 AnsiString(1252);无功s: UnicodeString;a:Win1252String;开始s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';a := Win1252String(s);s := UnicodeString(a);结尾;

vars: UnicodeString;u: UTF8String;开始s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';u := UTF8String(s);s := UnicodeString(u);结尾;

In the olden times, i had a function that would convert a WideString to an AnsiString of the specified code-page:

function WideStringToString(const Source: WideString; CodePage: UINT): AnsiString;
...
begin
   ...
    // Convert source UTF-16 string (WideString) to the destination using the code-page
    strLen := WideCharToMultiByte(CodePage, 0,
        PWideChar(Source), Length(Source), //Source
        PAnsiChar(cpStr), strLen, //Destination
        nil, nil);
    ...
end;

And everything worked. I passed the function a unicode string (i.e. UTF-16 encoded data) and converted it to an AnsiString, with the understanding that the bytes in the AnsiString represented characters from the specified code-page.

For example:

TUnicodeHelper.WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);

would return the Windows-1252 encoded string:

The qùíçk brown fôx jumped ovêr the lázÿ dog

Note: Information was of course lost during the conversion from the full Unicode character set to the limited confines of the Windows-1252 code page:

  • Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ (before)
  • The qùíçk brown fôx jumped ovêr the lázÿ dog (after)

But the Windows WideChartoMultiByte does a pretty good job of best-fit mapping; as it is designed to do.

Now the after times

Now we are in the after times. WideString is now a pariah, with UnicodeString being the goodness. It's an inconsequential change; as the Windows function only needed a pointer to a series of WideChar anyway (which a UnicodeString also is). So we change the declaration to use UnicodeString instead:

funtion WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
begin
   ...
end;

Now we come to the return value. i have an AnsiString that contains the bytes:

54 68 65 20 71 F9 ED E7  The qùíç
6B 20 62 72 6F 77 6E 20  k brown 
66 F4 78 20 6A 75 6D 70  fôx jump
65 64 20 6F 76 EA 72 20  ed ovêr 
74 68 65 20 6C E1 7A FF  the lázÿ
20 64 6F 67               dog

In the olden times that was fine. I kept track of what code-page the AnsiString actually contained; i had to remember that the returned AnsiString was not encoded using the computer's locale (e.g. Windows 1258), but instead is encoded using another code-page (the CodePage code page).

But in Delphi XE6 an AnsiString also secretly contains the codepage:

  • codePage: 1258
  • length: 44
  • value: The qùíçk brown fôx jumped ovêr the lázÿ dog

This code-page is wrong. Delphi is specifying the code-page of my computer, rather than the code-page that the string is. Technically this is not a problem, i always understood that the AnsiString was in a particular code-page, i just had to be sure to pass that information along.

So when i wanted to decode the string, i had to pass along the code-page with it:

s := TUnicodeHeper.StringToWideString(s, 1252);

with

function StringToWideString(s: AnsiString; CodePage: UINT): UnicodeString;
begin
   ...
   MultiByteToWideChar(...);
   ...
end;

Then one person screws everything up

The problem was that in the olden times i declared a type called Utf8String:

type
   Utf8String = type AnsiString;

Because it was common enough to have:

function TUnicodeHelper.WideStringToUtf8(const s: UnicodeString): Utf8String;
begin
   Result := WideStringToString(s, CP_UTF8);
end;

and the reverse:

function TUnicodeHelper.Utf8ToWideString(const s: Utf8String): UnicodeString;
begin
   Result := StringToWideString(s, CP_UTF8);
end;

Now in XE6 i have a function that takes a Utf8String. If some existing code somewhere were take a UTF-8 encoded AnsiString, and try to convert it to UnicodeString using Utf8ToWideString it would fail:

s: AnsiString;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);

...

 ws: UnicodeString;
 ws := Utf8ToWideString(s); //Delphi will treat s an CP1252, and convert it to UTF8

Or worse, is the breadth of existing code that does:

s: Utf8String;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);

The returned string will become totally mangled:

  • the function returns AnsiString(1252) (AnsiString tagged as encoded using the current codepage)
  • the return result is being stored in an AnsiString(65001) string (Utf8String)
  • Delphi converts the UTF-8 encoded string into UTF-8 as though it was 1252.

How to move forward

Ideally my UnicodeStringToString(string, codePage) function (which returns an AnsiString) could set the CodePage inside the string to match the actual code-page using something like SetCodePage:

function UnicodeStringToString(s: UnicodeString; CodePage: UINT): AnsiString;
begin
   ...
   WideCharToMultiByte(...);
   ...

   //Adjust the codepage contained in the AnsiString to match reality
   //SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
   if Length(Result) > 0 then
      PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;

Except that manually mucking around with the internal structure of an AnsiString is horribly dangerous.

So what about returning RawByteString?

It has been said, over an over, by a lot of people who aren't me that RawByteString is meant to be the universal recipient; it wasn't meant to be as a return parameter:

function UnicodeStringToString(s: UnicodeString; CodePage: UINT): RawByteString;
begin
   ...
   WideCharToMultiByte(...);
   ...

   //Adjust the codepage contained in the AnsiString to match reality
   SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
end;

This has the virtue of being able to use the supported and documented SetCodePage.

But if we're going to cross a line, and start returning RawByteString, surely Delphi already has a function that can convert a UnicodeString to a RawByteString string and vice versa:

function WideStringToString(const s: UnicodeString; CodePage: UINT): RawByteString;
begin
   Result := SysUtils.Something(s, CodePage);
end;

function StringToWideString(const s: RawByteString; CodePage: UINT): UnicodeString;
begin
   Result := SysUtils.SomethingElse(s, CodePage);       
end;

But what is it?

Or what else should i do?

This was a long-winded set of background for a trivial question. The real question is, of course, what should i be doing instead? There is a lot of code out there that depends on the UnicodeStringToString and the reverse.

tl;dr:

I can convert a UnicodeString to UTF by doing:

Utf8Encode('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');

and i can convert a UnicodeString to the current code-page by using:

AnsiString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');

But how do i convert a UnicodeString to an arbitrary (unspecified) code-page?

My feeling is that since everything really is an AnsiString:

Utf8String = AnsiString(65001);
RawByteString = AnsiString(65535);

i should bite the bullet, bust open the AnsiString structure, and poke the correct code-page into it:

function StringToAnsi(const s: UnicodeString; CodePage: UINT): AnsiString;
begin
   LocaleCharsFromUnicode(CodePage, ..., s, ...);

   ...

   if Length(Result) > 0 then
      PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;

Then the rest of the VCL will fall in line.

解决方案

In this particular case, using RawByteString is an appropriate solution:

function WideStringToString(const Source: UnicodeString; CodePage: UINT): RawByteString;
var
  strLen: Integer;
begin
  strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
  if strLen > 0 then
  begin
    SetLength(Result, strLen);
    LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
    SetCodePage(Result, CodePage, False);
  end;
end;

This way, the RawByteString holds the codepage, and assigning the RawByteString to any other string type, whether that be AnsiString or UTF8String or whatever, will allow the RTL to automatically convert the RawByteString data from its current codepage to the destination string's codepage (which includes conversions to UnicodeString).

If you absolutely must return an AnsiString (which I do not recommend), you can still use SetCodePage() via a typecast:

function WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
var
  strLen: Integer;
begin
  strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
  if strLen > 0 then
  begin
    SetLength(Result, strLen);
    LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
    SetCodePage(PRawByteString(@Result)^, CodePage, False);
  end;
end;

The reverse is much easier, just use the codepage already stored in a (Ansi|RawByte)String (just make sure those codepages are always accurate), since the RTL already knows how to retrieve and use the codepage for you:

function StringToWideString(const Source: AnsiString): UnicodeString;
begin
  Result := UnicodeString(Source);
end;

function StringToWideString(const Source: RawByteString): UnicodeString;
begin
  Result := UnicodeString(Source);
end;

That being said, I would suggest dropping the helper functions altogether and just use typed strings instead. Let the RTL handle conversions for you:

type
  Win1252String = type AnsiString(1252);

var
  s: UnicodeString;
  a: Win1252String;
begin
  s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
  a := Win1252String(s);
  s := UnicodeString(a);
end;

var
  s: UnicodeString;
  u: UTF8String;
begin
  s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
  u := UTF8String(s);
  s := UnicodeString(u);
end;

这篇关于将 UnicodeString 转换为 AnsiString的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆