使用 ReadList 读取 UTF-8 文本文件 [英] Reading UTF-8 text files with ReadList

查看:49
本文介绍了使用 ReadList 读取 UTF-8 文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用 ReadList 来读取使用 ReadList[..., Word] 的 UTF-8(或任何其他)编码文本文件,或者是 ASCII-只要?如果它是 ASCII-only,是否有可能以良好的性能修复"已经读取的数据的编码(即保留 ReadList 相对于 Import 的性能优势)?

Is it possible to use ReadList to read UTF-8 (or any other) encoded text files using ReadList[..., Word], or is it ASCII-only? If it's ASCII-only, is it possible to "fix" the encoding of the already read data with good performance (i.e. preserving the performance advantages of ReadList over Import)?

Import[..., CharacterEncoding ->"UTF8"] 可以工作,但它比 ReadList 慢很多.$CharacterEncodingReadList

Import[..., CharacterEncoding -> "UTF8"] works but it's quite a bit slower than ReadList. $CharacterEncoding has no effect on ReadList

在此处下载示例 UTF-8 编码文件.

要测试大型输入的性能,请参阅中的测试文件这个问题.

For testing performance on a large input, see the test file in this question.

以下是大型文本文件的答案时间:

Here are the timings of the answers on a large-ish text file:

导入

In[2]:= Timing[
 data = Import[file, "Text"];
 ]

Out[2]= {5.234, Null}

海克

In[4]:= Timing[
 data = ReadList[file, String];
 FromCharacterCode[ToCharacterCode[data], "UTF8"];
 ]

Out[4]= {4.328, Null}

先生巫师

In[5]:= Timing[
 string = FromCharacterCode[BinaryReadList[file], "UTF-8"];
 ]

Out[5]= {2.281, Null}

推荐答案

如果我省略 Word,这有效:

If I leave out Word, this works:

$CharacterEncoding = "UTF-8";

ReadList["UTF8.txt"]

然而这是一个失败,因为数据不是作为字符串读取的.

FromCharacterCode[BinaryReadList["UTF8.txt"], "UTF-8"]

这篇关于使用 ReadList 读取 UTF-8 文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆