逐行读取大文件并避免 Rust 中的 utf8 错误 [英] Reading a large file line by line and avoiding utf8 errors in Rust

查看:117
本文介绍了逐行读取大文件并避免 Rust 中的 utf8 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应该"的非常大的文件由 JSON 字符串组成.但是,当我使用以下代码时,我得到一个流不包含有效的 UTF8".

I have a really large file that "should" consist of JSON strings. However, when I use the following code, I get a "stream did not contain valid UTF8".

let file = File::open("foo.txt")?;
let reader = BufReader::new(file);

for line in reader.lines() {
    println!("{}", line?);
}

Ok(())

现在的答案是使用 Vec 而不是 String.但是我所看到的所有代码都将 file.read_to_end(buf) 作为答案,这对我必须使用的文件大小不起作用.

Now the answer to this is to use Vec rather than String. But all the code I've seen has file.read_to_end(buf) as the answer which won't work for the filesizes I have to work with.

我正在寻找的是逐行读取文件,使用有损 utf8 转换,然后进行一些计算并将输出推送到另一个文件.

What I'm looking for is to read the file line by line, use lossy utf8 conversion and then do some calculations and push the output to another file.

推荐答案

您可以使用 BufReader 的 read_until 函数.它与 File 的 read_to_end 非常相似,但也采用 byte 分隔符参数.此分隔符可以是任何字节,换行符 \n 字节将适合您.之后,您可以从 UTF-8 有损地转换缓冲区.它看起来像这样:

You can use BufReader's read_until function. It is very similar to File's read_to_end, but also takes a byte delimiter argument. This delimiter can be any byte, and a newline \n byte would be suitable for you. Afterwards, you can just lossily convert the buffer from UTF-8. It would look something like this:

let file = File::open("foo.txt")?;
let mut reader = BufReader::new(file);
let mut buf = vec![];

while let Ok(_) = reader.read_until(b'\n', &mut buf) {
    if buf.is_empty() {
        break;
    }
    let line = String::from_utf8_lossy(&buf);
    println!("{}", line);
    buf.clear();
}

Ok(())

当然,这可以抽象为迭代器,就像 一行就搞定了,但是基本逻辑和上面一样.

Of course, this could be abstracted away into an iterator, just like Lines is done, but the basic logic is the same as above.

注意:与 lines 函数不同,结果字符串将包含换行符和回车符 (\r)(如果有).如果解决方案的行为必须与 lines 函数匹配,则需要去除这些字符.

NOTE: unlike the lines function, the resulting strings will include the newline character, and the carriage return (\r) if there is one. It will be needed to strip those characters away, if the behaviour of the solution has to match the lines function.

这篇关于逐行读取大文件并避免 Rust 中的 utf8 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆