tiny-utf8:在字符/代码点中获得偏移量 [英] tiny-utf8: getting offset in characters / codepoints

查看:35
本文介绍了tiny-utf8:在字符/代码点中获得偏移量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tiny-utf8,它可以替代 std::string,但具有迭代 UTF-8 字符的能力.一切似乎都很好,但是,有时我的字符串会被其他库以原始形式 (char*) 检查(在我的情况下,它是 RE2).其他库返回子字符串的偏移量.除了它们返回的偏移量是针对原始字符串的,这意味着它们以字节为单位.

I am using tiny-utf8, which works as a drop-in replacement for std::string, but with the ability to iterate over UTF-8 characters. Everything seems fine, however, sometimes my strings are inspected in their raw form (char*) by other libraries (in my case, it's RE2). The other libraries return offsets of substrings. Except, the offsets they return are for the raw string, which means, they are in bytes.

我的问题是,如何将这些转换为代码点/字符偏移量?

My question is, how do I convert these to codepoint / character offsets?

我找到了一种方法,它似乎可以在一次调用中完成我所需要的:

I found a method which seems to allow accomplishing exactly what I need in one call:

utf8_string str = "我的 UTF-8 字符串";str.get_num_resulting_codepoints(0, offsetInBytes);

除此之外,它是受保护的.当然,我可以将其公开,但必须有隐藏的原因;应该有另一种方式.

Except, it's protected. I can, of course, make it public but there has to be a reason why it was hidden; there should be another way.

我也在考虑使用 raw_get 方法,但我不确定这样做是否正确:

I was also looking at utilising the raw_get method, but I am not sure if it's the right thing to do:

str.raw_get(offsetInBytes) - str.begin()

推荐答案

亲爱的 Vadim,

谢谢你的提问.方法 get_num_resulting_codepoints 在版本 2 中被重命名为 get_num_codepoints 并且另外被设为私有.我起草了一个新版本 "2.0.2",这使得 get_num_codepoints 再次公开(以及 get_num_bytesget_num_bytes_from_start).

thank you for your question. The method get_num_resulting_codepoints was renamed in Version 2 to get_num_codepoints and additionally was made private. I have drafted a new release "2.0.2", that makes get_num_codepoints public again (along with get_num_bytes and get_num_bytes_from_start).

您可以像以前一样使用它.然而,使用减法迭代器的解决方案更优雅一些,因为它完全相同并且同样有效.我会坚持那个:)

You can use it the same way as you did before. However, the solution with subtracting iterators is a little bit more elegant as it does exactly the same and is equally efficient. I would stick to that one :)

希望这个回答对你有帮助

I hope this answer has helped you,

雅各布

这篇关于tiny-utf8:在字符/代码点中获得偏移量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆