tiny-utf8:在字符/代码点中获得偏移量 [英] tiny-utf8: getting offset in characters / codepoints

查看：35 发布时间：2021/9/15 19:43:34 c++ unicode utf-8

本文介绍了tiny-utf8:在字符/代码点中获得偏移量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 tiny-utf8，它可以替代 std::string，但具有迭代 UTF-8 字符的能力.一切似乎都很好，但是，有时我的字符串会被其他库以原始形式 (char*) 检查(在我的情况下，它是 RE2).其他库返回子字符串的偏移量.除了它们返回的偏移量是针对原始字符串的，这意味着它们以字节为单位.

I am using tiny-utf8, which works as a drop-in replacement for std::string, but with the ability to iterate over UTF-8 characters. Everything seems fine, however, sometimes my strings are inspected in their raw form (char*) by other libraries (in my case, it's RE2). The other libraries return offsets of substrings. Except, the offsets they return are for the raw string, which means, they are in bytes.

我的问题是，如何将这些转换为代码点/字符偏移量?

My question is, how do I convert these to codepoint / character offsets?

我找到了一种方法，它似乎可以在一次调用中完成我所需要的:

I found a method which seems to allow accomplishing exactly what I need in one call:

utf8_string str = "我的 UTF-8 字符串";str.get_num_resulting_codepoints(0, offsetInBytes);

除此之外，它是受保护的.当然，我可以将其公开，但必须有隐藏的原因；应该有另一种方式.

Except, it's protected. I can, of course, make it public but there has to be a reason why it was hidden; there should be another way.

我也在考虑使用 raw_get 方法，但我不确定这样做是否正确:

I was also looking at utilising the raw_get method, but I am not sure if it's the right thing to do:

str.raw_get(offsetInBytes) - str.begin()

tiny-utf8:在字符/代码点中获得偏移量 [英] tiny-utf8: getting offset in characters / codepoints

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

tiny-utf8:在字符/代码点中获得偏移量 [英] tiny-utf8: getting offset in characters / codepoints

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭