从字符串中删除无效的UTF-8字符（Go lang） [英] Remove invalid UTF-8 characters from a string (Go lang)

查看：1074 发布时间：2018/5/2 17:56:17 json unicode go

本文介绍了从字符串中删除无效的UTF-8字符（Go lang）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

  json：字符串中的UTF-8无效： ... ole \xc5\

原因很明显，但是如何删除/在Go中替换这样的字符串？我一直在阅读 unicode 和 unicode / utf8 包的docst，似乎没有明显的/快捷的方式来做到这一点。在Python中，例如，你可以使用方法删除无效字符，替换为指定字符或严格设置，这会在无效字符上引发异常。如何在Go中做同样的事情？

UPDATE：我的意思是得到一个异常（恐慌？）的原因 - 在json.Marshal期望有效的非法字符UTF-8字符串。

（非法字节序列如何进入该字符串并不重要，通常的方法 - 错误，文件损坏，其他不符合unicode等）

解决方案

例如，

 包主
 $ b导入（
fmt
unicode / utf8
）
 
 func main（ ）{
s：=a\xc5z
 fmt.Printf（％q\\\
，s）
 if！utf8.ValidString（s）{
v：如果r = utf8.RuneError {
 _，size：= utf8。 DecodeRuneInString（s [i：]）
 if size == 1 {
 continue 
} 
} 
v = append（v，r）
 } 
s = string（v）
} 
 fmt.Printf（％q\\\
，s）
}

输出：

 a\xc5z
az

非Unicode标准

FAQ - UTF-8，UTF-16，UTF-32& BOM

问：是否有任何字节序列不是由UTF生成的？我应该如何解释
？

答：没有一个UTF可以生成每个任意的字节序列。例如，对于
，UTF-8格式为110xxxxx2的每个字节都必须跟有
格式的10xxxxxx2格式的字节。诸如< 110xxxxx2
0xxxxxxx2>之类的序列是非法的，并且决不能生成。当在转换或解释时遇到
这个非法字节序列时，UTF-8
一致性过程必须将第一个字节110xxxxx2视为非法
终止错误：例如，要么发出错误信号，过滤
字节，或者用诸如FFFD
（REPLACEMENT CHARACTER）之类的标记表示字节。在后两种情况下，它将继续在第二个字节0xxxxxxx2处进行
处理。

一致性进程不能解释非法或格式错误的字节
序列作为字符，但是，它可能需要错误恢复操作。
没有符合规定的进程可能会使用不规则的字节序列来编码
带外信息。

I get this on json.Marshal of a list of strings:

json: invalid UTF-8 in string: "...ole\xc5\"

The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.

In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?

UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.

(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

解决方案

For example,

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "a\xc5z"
    fmt.Printf("%q\n", s)
    if !utf8.ValidString(s) {
        v := make([]rune, 0, len(s))
        for i, r := range s {
            if r == utf8.RuneError {
                _, size := utf8.DecodeRuneInString(s[i:])
                if size == 1 {
                    continue
                }
            }
            v = append(v, r)
        }
        s = string(v)
    }
    fmt.Printf("%q\n", s)
}

Output:

"a\xc5z"
"az"

Unicode Standard

FAQ - UTF-8, UTF-16, UTF-32 & BOM

Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?

A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.

这篇关于从字符串中删除无效的UTF-8字符（Go lang）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从字符串中删除无效的UTF-8字符（Go lang） [英] Remove invalid UTF-8 characters from a string (Go lang)

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

从字符串中删除无效的UTF-8字符（Go lang） [英] Remove invalid UTF-8 characters from a string (Go lang)

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭