为什么在 Julia 中不鼓励对 UTF8 字符串进行索引? [英] Why is indexing of UTF8 strings discouraged in Julia?

查看:16
本文介绍了为什么在 Julia 中不鼓励对 UTF8 字符串进行索引?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Julia 入门指南在 Y 分钟内学习 Julia 不鼓励用户对 UTF8 字符串进行索引:

The introductory guide to Julia, Learn Julia in Y Minutes, discourages users from indexing UTF8 strings:

# Some strings can be indexed like an array of characters
"This is a string"[1] # => 'T' # Julia indexes from 1
# However, this is will not work well for UTF8 strings,
# so iterating over strings is recommended (map, for loops, etc).

为什么不鼓励对此类字符串进行迭代?这种替代字符串类型的结构具体是什么使索引容易出错?这是 Julia 特有的陷阱,还是扩展到所有支持 UTF8 字符串的语言?

Why is iterating over such strings discouraged? What specifically about the structure of this alternate string type makes indexing error prone? Is this a Julia specific pitfall, or does this extend to all languages with UTF8 string support?

推荐答案

因为在 UTF8 中,字符并不总是以单个字节编码.

Because in UTF8 a character is not always encoded in a single byte.

以德语字符串 böse(邪恶)为例.该字符串在 UTF8 编码中的字节数为:

Take for example the german language string böse (evil). The bytes of this string in UTF8 encoding are:

0x62 0xC3 0xB6 0x73 0x65
b    ö         s    e

如您所见,元音变音 ö 需要 2 个字节.

As you can see the umlaut ö requires 2 bytes.

现在如果你直接索引这个 UTF8 编码字符串 "böse"[4] 会给你 s 而不是 e.

Now if you directly index this UTF8 encoded string "böse"[4] will give you sand not e.

但是,您可以在 julia 中将字符串用作可迭代对象:

However, you can use the string as an iterable object in julia:

julia> for c in "böse"
           println(c)
       end
b
ö
s
e

既然你问过,不,UTF8 字符串的直接字节索引问题并不是 Julia 特有的.

And since you've asked, No, direct byte indexing issues with UTF8 strings are not specific to Julia.

推荐阅读:
http://docs.julialang.org/en/release-0.4/manual/strings/#unicode-and-utf-8

这篇关于为什么在 Julia 中不鼓励对 UTF8 字符串进行索引?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆