一个人物可以跨越多个符文吗? [英] Can a character span multiple runes in Go?

查看:202
本文介绍了一个人物可以跨越多个符文吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在此博客上阅读了这篇文章


即使使用符文片段,单个字符也可能跨越多个符文,例如,如果你有重音符的字符, 字符的复杂和模糊性是Go字符串以字节序列表示的原因。


这是真的吗? (它似乎是一个从知道Go的人的博客)。我在我的机器上测试,è是1符文和2字节。而且 Go doc 似乎有所不同。



你遇到过这样的字符吗?

解决方案

可以:

  s:=é
fmt.Println(s,[] rune(s))
> nofollow> Go Playground ):

 é[101 769 769 769] 

一个字符,4个符文。它可能是任意长...



示例取自 Go



正如在字符串博客文章中提到的,字符可以跨越多个符文。例如,e和◌(急性\\\́)可以组合形成é(NFD中的e\\\́)。这两个符文一起是一个字符。字符的定义可能因应用程序而异。对于规范化,我们将它定义为以启动器开始的符文序列,不与任何其他符文向后修改或组合的符文,之后是非启动器的可能空序列,也就是说,符文(通常是重音符)。标准化算法会处理一个字符。


字符后面可以跟任意数量的修饰符(修饰符可以重复和堆叠):


理论上,没有限制可以组成Unicode字符的符文数量。事实上,对可以跟随字符的修饰符的数量没有限制,并且修饰符可以重复或堆叠。曾经见过一个有三个acute的'e'?这里你去:é。根据标准,这是一个完全有效的4符文字符。


另请参阅:组合角色



编辑:这不是杀死了符文的概念吗?



回答:这不是符文的概念。 rune 不是字符。 rune是标识Unicode代码点的整数值。字符 可以是一个Unicode代码点,在这种情况下,1个字符为1 rune 。大多数一般使用 rune s适合这种情况,因此在实践中这几乎不会造成任何头痛。这是 Unicode标准的概念。


I read this on this blog

Even with rune slices a single character might span multiple runes, which can happen if you have characters with grave accent, for example. This complicated and ambiguous nature of "characters" is the reason why Go strings are represented as byte sequences.

Is it true ? (it seems like a blog from someone who knows Go). I tested on my machine and "è" is 1 rune and 2 bytes. And the Go doc seems to say otherwise.

Have you encountered such characters ? (utf-8) Can a character span multiple runes in Go ?

解决方案

Yes it can:

s := "é́́"
fmt.Println(s, []rune(s))

Output (try it on the Go Playground):

é́́ [101 769 769 769]

One character, 4 runes. It may be arbitrary long...

Example taken from The Go Blog: Text Normalization in Go.

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes. For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character. The definition of a character may vary depending on the application. For normalization we will define it as a sequence of runes that starts with a starter, a rune that does not modify or combine backwards with any other rune, followed by possibly empty sequence of non-starters, that is, runes that do (typically accents). The normalization algorithm processes one character at at time.

A character can be followed by any number of modifiers (modifiers can be repeated and stacked):

Theoretically, there is no bound to the number of runes that can make up a Unicode character. In fact, there are no restrictions on the number of modifiers that can follow a character and a modifier may be repeated, or stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a perfectly valid 4-rune character according to the standard.

Also see: Combining character.

Edit: "Doesn't this kill the 'concept of runes'?"

Answer: It's not a concept of runes. A rune is not a character. A rune is an integer value identifying a Unicode code point. A character may be one Unicode code point in which case 1 character is 1 rune. Most of the general use of runes fits into this case, so in practice this hardly gives any headaches. It's a concept of the Unicode standard.

这篇关于一个人物可以跨越多个符文吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆