如何检测 Go 中何时无法将字节转换为字符串? [英] How to detect when bytes can't be converted to string in Go?

查看:23
本文介绍了如何检测 Go 中何时无法将字节转换为字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

存在无法转换为 Unicode 字符串的无效字节序列.在 Go 中将 []byte 转换为 string 时如何检测?

There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte to string in Go?

推荐答案

正如 Tim Cooper 所说,您可以使用 utf8.Valid.

You can, as Tim Cooper noted, test UTF-8 validity with utf8.Valid.

但是!您可能认为将非 UTF-8 字节转换为 Go string 是不可能的.事实上,在 Go 中,字符串实际上是一个只读的字节片";它可以包含不是有效 UTF-8 的字节,您可以打印、通过索引访问、传递给 WriteString 方法,甚至往返返回到 []byte(例如,Write).

But! You might be thinking that converting non-UTF-8 bytes to a Go string is impossible. In fact, "In Go, a string is in effect a read-only slice of bytes"; it can contain bytes that aren't valid UTF-8 which you can print, access via indexing, pass to WriteString methods, or even round-trip back to a []byte (to Write, say).

在语言中有两个地方 Go 确实为你做 string 的 UTF-8 解码.

There are two places in the language that Go does do UTF-8 decoding of strings for you.

  • 当您执行 for i, r := range s 时,r 是一个 Unicode 代码点,作为 rune 类型的值莉>
  • 当您进行转换[]rune(s)时,Go 会将整个字符串解码为符文.
  • when you do for i, r := range s the r is a Unicode code point as a value of type rune
  • when you do the conversion []rune(s), Go decodes the whole string to runes.

(注意runeint32的别名,不是完全不同的类型.)

(Note that rune is an alias for int32, not a completely different type.)

在这两种情况下,无效的 UTF-8 被替换为 U+FFFD替换字符 保留用于此类用途.更多内容在 for 语句 的规范部分="https://golang.org/ref/spec#Conversions_to_and_from_a_string_type" rel="nofollow noreferrer">strings 和其他类型之间的转换.这些转换永远不会崩溃,因此您只需要主动检查 UTF-8 有效性是否与您的应用程序相关,例如如果您不能接受 U+FFFD 替换并且需要在错误编码的输入上抛出错误.

In both these instances invalid UTF-8 is replaced with U+FFFD, the replacement character reserved for uses like this. More is in the spec sections on for statements and conversions between strings and other types. These conversions never crash, so you only need to actively check for UTF-8 validity if it's relevant to your application, like if you can't accept the U+FFFD replacement and need to throw an error on mis-encoded input.

由于该行为已融入语言,因此您也可以从库中期待它.U+FFFDutf8.RuneError 并由 utf8 中的函数返回.

Since that behavior's baked into the language, you can expect it from libraries, too. U+FFFD is utf8.RuneError and returned by functions in utf8.

这是一个示例程序,展示了 Go 对包含无效 UTF-8 的 []byte 做了什么:

Here's a sample program showing what Go does with a []byte holding invalid UTF-8:

package main

import "fmt"

func main() {
    a := []byte{0xff}
    s := string(a)
    fmt.Println(s)
    for _, r := range s {
        fmt.Println(r)
    }
    rs := []rune(s)
    fmt.Println(rs)
}

输出在不同的环境中看起来会有所不同,但在 Playground 中它看起来像

Output will look different in different environments, but in the Playground it looks like

�
65533
[65533]

这篇关于如何检测 Go 中何时无法将字节转换为字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆