将UnicodeString的char pos转换为utf8字符串中的字节pos [英] Convert char pos of UnicodeString to byte pos in a utf8 string
问题描述
我使用 Scintilla 并将其编码设置为utf8(这是使其与utf8兼容的唯一方法Unicode字符,如果我理解正确的话)。通过此设置,在文本Scintilla中谈论 positions 时,意味着 byte 位置。
I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.
问题是,我在程序的其余部分中使用UnicodeString,并且当我需要在Scintilla编辑器中选择特定范围时,我需要将UnicodeString的字符转换为与UnicodeString对应的utf8字符串中的字节pos。我如何轻松做到这一点?谢谢。
The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.
PS,当我找到 ByteToCharIndex 我认为这是我需要的,但是,根据其文档和测试结果,它仅在系统使用多字节字符系统(MBCS)的情况下有效。
PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).
推荐答案
您应该使用 UTF8描述。我已经写了一个 ByteToCharIndex
的快速UTF8类似物,并在西里尔字母字符串上进行了测试:
You should parse UTF8 strings yourself using UTF8 description. I have written a quick UTF8 analog of ByteToCharIndex
and tested on cyrillic string:
function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
I: Integer;
P: PAnsiChar;
begin
Result:= 0;
if (Index <= 0) or (Index > Length(S)) then Exit;
I:= 1;
P:= PAnsiChar(S);
while I <= Index do begin
if Ord(P^) and $C0 <> $80 then Inc(Result);
Inc(I);
Inc(P);
end;
end;
const TestStr: UTF8String = 'abФЫВА';
procedure TForm1.Button2Click(Sender: TObject);
begin
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;
反向功能也没问题:
The reverse function is no problem too:
function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
P: PAnsiChar;
begin
Result:= 0;
P:= PAnsiChar(S);
while (Result < Length(S)) and (Index > 0) do begin
Inc(Result);
if Ord(P^) and $C0 <> $80 then Dec(Index);
Inc(P);
end;
if Index <> 0 then Result:= 0; // char index not found
end;
这篇关于将UnicodeString的char pos转换为utf8字符串中的字节pos的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!