将UnicodeString的char pos转换为utf8字符串中的字节pos [英] Convert char pos of UnicodeString to byte pos in a utf8 string

查看:143
本文介绍了将UnicodeString的char pos转换为utf8字符串中的字节pos的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Scintilla 并将其编码设置为utf8(这是使其与utf8兼容的唯一方法Unicode字符,如果我理解正确的话)。通过此设置,在文本Scintilla中谈论 positions 时,意味着 byte 位置。

I use Scintilla and set it's encoding to utf8 (and this is the only way to make it compatible with Unicode characters, if I understand it correctly). With this set up, when talking about a positions in the text Scintilla means byte positions.

问题是,我在程序的其余部分中使用UnicodeString,并且当我需要在Scintilla编辑器中选择特定范围时,我需要将UnicodeString的字符转换为与UnicodeString对应的utf8字符串中的字节pos。我如何轻松做到这一点?谢谢。

The problem is, I use UnicodeString in the rest of my program, and when I need to select a particular rang in the Scintilla editor, I need to convert from char pos of the UnicodeString to byte pos in a utf8 string that's corresponding to the UnicodeString. How can I do that easily? Thanks.

PS,当我找到 ByteToCharIndex 我认为这是我需要的,但是,根据其文档和测试结果,它仅在系统使用多字节字符系统(MBCS)的情况下有效。

PS, when I found ByteToCharIndex I thought it's what I need, however, according to its document and the result of my testing, it only works If the system uses a multi-byte character system (MBCS).

推荐答案

您应该使用 UTF8描述。我已经写了一个 ByteToCharIndex 的快速UTF8类似物,并在西里尔字母字符串上进行了测试:

You should parse UTF8 strings yourself using UTF8 description. I have written a quick UTF8 analog of ByteToCharIndex and tested on cyrillic string:

function UTF8PosToCharIndex(const S: UTF8String; Index: Integer): Integer;
var
  I: Integer;
  P: PAnsiChar;

begin
  Result:= 0;
  if (Index <= 0) or (Index > Length(S)) then Exit;
  I:= 1;
  P:= PAnsiChar(S);
  while I <= Index do begin
    if Ord(P^) and $C0 <> $80 then Inc(Result);
    Inc(I);
    Inc(P);
  end;
end;

const TestStr: UTF8String = 'abФЫВА';

procedure TForm1.Button2Click(Sender: TObject);
begin
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 1))); // a = 1
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 2))); // b = 2
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 3))); // Ф = 3
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 5))); // Ы = 4
  ShowMessage(IntToStr(UTF8PosToCharIndex(TestStr, 7))); // В = 5
end;






反向功能也没问题:


The reverse function is no problem too:

function CharIndexToUTF8Pos(const S: UTF8String; Index: Integer): Integer;
var
  P: PAnsiChar;

begin
  Result:= 0;
  P:= PAnsiChar(S);
  while (Result < Length(S)) and (Index > 0) do begin
    Inc(Result);
    if Ord(P^) and $C0 <> $80 then Dec(Index);
    Inc(P);
  end;
  if Index <> 0 then Result:= 0;  // char index not found
end;

这篇关于将UnicodeString的char pos转换为utf8字符串中的字节pos的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆