从任何UTF-16偏移量中，查找位于字符边界上的相应String.Index [英] From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

查看：54 发布时间：2022/3/24 12:40:59 swift string unicode swift4

本文介绍了从任何UTF-16偏移量中，查找位于字符边界上的相应String.Index的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的目标：给定String中的任意UTF-16位置，找到表示指定UTF-16代码单元所属的Character(即扩展字素簇)的相应String.Index。

示例：

(I put the code in a Gist for easy copying and pasting.)

这是我的测试字符串：

let str = "👨🏾‍🚒"

(注意：要将字符串视为单个字符，您需要在相当新的操作系统/浏览器组合上阅读该字符串，该操作系统/浏览器组合可以处理Unicode 9中引入的带有皮肤色调的新职业表情符号。)

它是一个Character(字素簇)，由4个Unicode标量或7个UTF-16代码单元组成：

print(str.unicodeScalars.map { "0x(String($0.value, radix: 16))" })
// → ["0x1f468", "0x1f3fe", "0x200d", "0x1f692"]
print(str.utf16.map { "0x(String($0, radix: 16))" })
// → ["0xd83d", "0xdc68", "0xd83c", "0xdffe", "0x200d", "0xd83d", "0xde92"]
print(str.utf16.count)
// → 7

给定任意的UTF-16偏移量(比如2)，我可以创建相应的String.Index：

let utf16Offset = 2
let utf16Index = String.Index(encodedOffset: utf16Offset)

我可以用此索引下标字符串，但如果索引没有落在Character边界上，则下标返回的Character可能不会覆盖整个字素簇：

let char = str[utf16Index]
print(char)
// → 🏾‍🚒
print(char.unicodeScalars.map { "0x(String($0.value, radix: 16))" })
// → ["0x1f3fe", "0x200d", "0x1f692"]

或者下标操作甚至可能陷入陷阱(我不确定这是否预期的行为)：

let trappingIndex = String.Index(encodedOffset: 1)
str[trappingIndex]
// fatal error: Can't form a Character from a String containing more than one extended grapheme cluster

您可以测试索引是否落入Character边界：

extension String.Index {
    func isOnCharacterBoundary(in str: String) -> Bool {
        return String.Index(self, within: str) != nil
    }
}

trappingIndex.isOnCharacterBoundary(in: str)
// → false (as expected)
utf16Index.isOnCharacterBoundary(in: str)
// → true (WTF!)

问题：

我认为问题在于最后一个表达式返回true。The documentation for String.Index.init(_:within:)表示：

如果作为sourcePosition传递的索引表示扩展字素簇(字符串的元素类型)的开始，则初始值设定项成功。

此处，utf16Index不代表扩展字素簇的开始-字素簇从偏移量0开始，而不是从偏移量2开始。但是初始化式成功。

因此，我通过重复递减索引的encodedOffset和测试isOnCharacterBoundary来查找字素簇起点的所有尝试都失败。

我是不是忽略了什么？有没有其他方法可以测试索引是否落在Character的开头？这是SWIFT中的错误吗？

我的环境：MacOS 10.13上的SWIFT 4.0/Xcode 9.0。

更新：查看感兴趣的Twitter thread about this question。

更新：我将SWIFT 4.0中的String.Index.init?(_:within:)行为报告为错误：SR-5992。

推荐答案

一种可能的解决方案，使用rangeOfComposedCharacterSequence(at:) 方法：

extension String {
    func index(utf16Offset: Int) -> String.Index? {
        guard utf16Offset >= 0 && utf16Offset < utf16.count else { return nil }
        let idx = String.Index(encodedOffset: utf16Offset)
        let range = rangeOfComposedCharacterSequence(at: idx)
        return range.lowerBound
    }
}

示例：

let str = "a👨🏾‍🚒b🇩🇪c😀d👩‍👩‍👧‍👧e"
for utf16Offset in 0..<str.utf16.count {
    if let idx = str.index(utf16Offset: utf16Offset) {
        print(utf16Offset, str[idx])
    }
}

输出：

0 a
1 👨🏾‍🚒
2 👨🏾‍🚒
3 👨🏾‍🚒
4 👨🏾‍🚒
5 👨🏾‍🚒
6 👨🏾‍🚒
7 👨🏾‍🚒
8 b
9 🇩🇪
10 🇩🇪
11 🇩🇪
12 🇩🇪
13 c
14 😀
15 😀
16 d
17 👩‍👩‍👧‍👧
18 👩‍👩‍👧‍👧
19 👩‍👩‍👧‍👧
20 👩‍👩‍👧‍👧
21 👩‍👩‍👧‍👧
22 👩‍👩‍👧‍👧
23 👩‍👩‍👧‍👧
24 👩‍👩‍👧‍👧
25 👩‍👩‍👧‍👧
26 👩‍👩‍👧‍👧
27 👩‍👩‍👧‍👧
28 e

这篇关于从任何UTF-16偏移量中，查找位于字符边界上的相应String.Index的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从任何UTF-16偏移量中，查找位于字符边界上的相应String.Index [英] From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

问题描述

推荐答案

相关文章

移动开发最新文章

热门教程

热门工具

登录关闭

从任何UTF-16偏移量中，查找位于字符边界上的相应String.Index [英] From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

问题描述

推荐答案

相关文章

移动开发最新文章

热门教程

热门工具

登录 关闭

登录关闭