是否可以将字节解码为UTF-8，将错误转换为Rust中的转义序列? [英] Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?

查看：346 发布时间：2020/7/12 18:43:32 utf-8 rust unicode-escapes

本文介绍了是否可以将字节解码为UTF-8，将错误转换为Rust中的转义序列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Rust中，可以通过执行以下操作从字节中获取UTF-8:

In Rust it's possible to get UTF-8 from bytes by doing this:

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!("example {}", s);
}

这行得通或行不通，但是Python可以处理错误，例如:

This either works or it doesn't, but Python has the ability to handle errors, e.g.:

s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

在此示例中，参数surrogateescape将无效的utf-8序列转换为转义码，因此，代替忽略或替换无法解码的文本，将其替换为有效的.有关详细信息，请参见: Python文档.

In this example the argument surrogateescape converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8. see: Python docs for details.

Rust是否有办法从字节中获取UTF-8字符串，从而避免错误而不是完全失败?

Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?

推荐答案

是，通过

如果您需要对该过程进行更多控制，则可以使用 std::str::from_utf8 ，如其他答案所建议.但是，没有理由像建议的那样对字节进行双重验证.

If you need more control over the process, you can use std::str::from_utf8, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.

一个快速被黑客入侵的例子:

A quickly hacked-up example:

use std::str;

fn example(mut bytes: &[u8]) -> String {
    let mut output = String::new();

    loop {
        match str::from_utf8(bytes) {
            Ok(s) => {
                // The entire rest of the string was valid UTF-8, we are done
                output.push_str(s);
                return output;
            }
            Err(e) => {
                let (good, bad) = bytes.split_at(e.valid_up_to());

                if !good.is_empty() {
                    let s = unsafe {
                        // This is safe because we have already validated this
                        // UTF-8 data via the call to `str::from_utf8`; there's
                        // no need to check it a second time
                        str::from_utf8_unchecked(good)
                    };
                    output.push_str(s);
                }

                if bad.is_empty() {
                    //  No more data left
                    return output;
                }

                // Do whatever type of recovery you need to here
                output.push_str("<badbyte>");

                // Skip the bad byte and try again
                bytes = &bad[1..];
            }
        }
    }
}

fn main() {
    let r = example(&[104, 101, 0xFF, 108, 111]);
    println!("{}", r); // he<badbyte>lo
}

您可以将其扩展为采用值来替换坏字节，使用闭包来处理坏字节等.例如:

You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:

fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
    // ...    
                handler(&mut output, bad);
    // ...
}

let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
    use std::fmt::Write;
    write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo

另请参阅:

How do I convert a Vector of bytes (u8) to a string
How to print a u8 slice as text if I don't care about the particular encoding?.

这篇关于是否可以将字节解码为UTF-8，将错误转换为Rust中的转义序列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否可以将字节解码为UTF-8，将错误转换为Rust中的转义序列? [英] Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

是否可以将字节解码为UTF-8，将错误转换为Rust中的转义序列? [英] Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭