从字符串中删除无效的UTF-8字符 [英] Remove invalid UTF-8 characters from a string

查看:268
本文介绍了从字符串中删除无效的UTF-8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在json.marshal上获得了一个字符串列表:

I get this on json.Marshal of a list of strings:

json: invalid UTF-8 in string: "...ole\xc5\"

原因很明显,但是如何在Go中删除/替换这样的字符串?我一直在阅读unicodeunicode/utf8软件包上的docst,似乎没有明显/快速的方法.

The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.

例如,在Python中,您有一些方法可以删除无效字符,将其替换为指定的字符或严格的设置,这会在无效字符上引发异常.如何在Go中做等效的事情?

In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?

更新:我的意思是得到异常的原因(紧急?)-json.Marshal期望是有效的UTF-8字符串中的非法字符.

UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.

(非法字节序列如何进入该字符串并不重要,通常的方式-错误,文件损坏,其他不符合unicode的程序等)

(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

推荐答案

例如,

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "a\xc5z"
    fmt.Printf("%q\n", s)
    if !utf8.ValidString(s) {
        v := make([]rune, 0, len(s))
        for i, r := range s {
            if r == utf8.RuneError {
                _, size := utf8.DecodeRuneInString(s[i:])
                if size == 1 {
                    continue
                }
            }
            v = append(v, r)
        }
        s = string(v)
    }
    fmt.Printf("%q\n", s)
}

输出:

"a\xc5z"
"az"

Unicode标准

常见问题解答-UTF-8,UTF-16,UTF-32& BOM

Q:是否有UTF不能生成的字节序列?如何 我应该给他们解释吗?

Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?

A:没有一个UTF可以生成每个任意字节序列.为了 例如,在UTF-8中,必须遵循格式为110xxxxx2的每个字节 格式为10xxxxxx2的字节.诸如< 110xxxxx2之类的序列 0xxxxxxx2>是非法的,并且绝对不能生成.当面对 转换或解释时,此非法字节序列为UTF-8 一致的进程必须将第一个字节110xxxxx2视为非法 终止错误:例如,发信号通知错误,过滤 字节输出,或用标记(例如FFFD)表示字节 (更换字符).在后两种情况下,它将继续 在第二个字节0xxxxxxx2处进行处理.

A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

一致的进程不得解释非法或格式错误的字节 序列作为字符,但是,它可能会采取错误恢复措施. 任何符合标准的过程都不能使用不规则字节序列进行编码 带外信息.

A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.

这篇关于从字符串中删除无效的UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆