是否可以将字节解码为UTF-8,将错误转换为Rust中的转义序列? [英] Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?
问题描述
在Rust中,可以通过执行以下操作从字节中获取UTF-8:
In Rust it's possible to get UTF-8 from bytes by doing this:
if let Ok(s) = str::from_utf8(some_u8_slice) {
println!("example {}", s);
}
这行得通或行不通,但是Python可以处理错误,例如:
This either works or it doesn't, but Python has the ability to handle errors, e.g.:
s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');
在此示例中,参数surrogateescape
将无效的utf-8序列转换为转义码,因此,代替忽略或替换无法解码的文本,将其替换为有效的
In this example the argument surrogateescape
converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8
. see: Python docs for details.
Rust是否有办法从字节中获取UTF-8字符串,从而避免错误而不是完全失败?
Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?
推荐答案
是,通过
如果您需要对该过程进行更多控制,则可以使用 If you need more control over the process, you can use 一个快速被黑客入侵的例子: A quickly hacked-up example: 您可以将其扩展为采用值来替换坏字节,使用闭包来处理坏字节等.例如: You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:
另请参阅: 这篇关于是否可以将字节解码为UTF-8,将错误转换为Rust中的转义序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!std::str::from_utf8
,如其他答案所建议.但是,没有理由像建议的那样对字节进行双重验证.std::str::from_utf8
, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.use std::str;
fn example(mut bytes: &[u8]) -> String {
let mut output = String::new();
loop {
match str::from_utf8(bytes) {
Ok(s) => {
// The entire rest of the string was valid UTF-8, we are done
output.push_str(s);
return output;
}
Err(e) => {
let (good, bad) = bytes.split_at(e.valid_up_to());
if !good.is_empty() {
let s = unsafe {
// This is safe because we have already validated this
// UTF-8 data via the call to `str::from_utf8`; there's
// no need to check it a second time
str::from_utf8_unchecked(good)
};
output.push_str(s);
}
if bad.is_empty() {
// No more data left
return output;
}
// Do whatever type of recovery you need to here
output.push_str("<badbyte>");
// Skip the bad byte and try again
bytes = &bad[1..];
}
}
}
}
fn main() {
let r = example(&[104, 101, 0xFF, 108, 111]);
println!("{}", r); // he<badbyte>lo
}
fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
// ...
handler(&mut output, bad);
// ...
}
let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
use std::fmt::Write;
write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo