如何在Julia中加载UTF16编码的文本文件? [英] How do I load a UTF16-encoded text file in Julia?

查看:129
本文介绍了如何在Julia中加载UTF16编码的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件(很确定)是使用UTF16编码的,但是我不知道如何在Julia中加载它.我是否必须将其加载为字节,然后使用UTF16String进行转换?

I have a text file I am (pretty sure) is encoded in UTF16, but I don't know how to load it in Julia. Do I have to load it as bytes and then convert with UTF16String?

推荐答案

最简单的方法是将其读取为字节,然后进行转换:

The simplest way is to read it as bytes and then convert:

s = open(filename, "r") do f
    utf16(readbytes(f))
end

请注意,utf16还会检查字节顺序标记(BOM),因此它将处理字节顺序问题,并且不会在生成的s中包含BOM.

Note that utf16 also checks for a byte-order-mark (BOM), so it will deal with endianness issues and won't include the BOM in the resulting s.

如果您真的想避免制作数据副本,并且知道它是本机端序的,也可以这样做,但是您必须显式编写一个NUL终止符(因为Julia UTF-16字符串数据内部具有一个最后是NUL代码点,用于传递给需要NUL终止数据的C例程)

If you really want to avoid making a copy of the data, and you know it is native-endian, this is possible too, but you have to explicitly write a NUL terminator (since Julia UTF-16 string data internally has a NUL codepoint at the end for passing to C routines that expect NUL-terminated data):

s = open(filename, "r") do f
    b = readbytes(f)
    resize!(b, length(b)+2)
    b[end] = b[end-1] = 0
    UTF16String(reinterpret(UInt16, b))
end

但是,典型的UTF-16文本文件将以BOM表开头,在这种情况下,字符串s会将BOM表作为其第一个字符,这可能不是您想要的.

However, typical UTF-16 text files will start with a BOM, and in this case the string s will include the BOM as its first character, which may not be what you want.

这篇关于如何在Julia中加载UTF16编码的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆