使用NumPy loadtxt/savetxt指定编码 [英] Specifying encoding using NumPy loadtxt/savetxt

查看:553
本文介绍了使用NumPy loadtxt/savetxt指定编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每当涉及非ASCII字符时,使用NumPy loadtxtsavetxt函数都会失败.这些功能主要用于数字数据,但也支持字母数字的页眉/页脚.

loadtxtsavetxt似乎都在使用latin-1编码,我发现它与Python 3的其余部分非常正交,而Python 3完全是Unicode感知的,并且似乎总是使用utf-8作为默认编码.

鉴于NumPy尚未将utf-8作为默认编码,我可以至少通过一些实现的功能/属性或已知的技巧(至少针对loadtxt)将编码从latin-1更改为/savetxt还是整个NumPy?

使用Python 2不可能做到这一点是可以原谅的,但是在使用Python 3时这确实不是问题.我发现使用Python 3.x和最新版本的NumPy的任何组合都存在问题. /p>

示例代码

考虑包含内容的文件data.txt

# This is π
3.14159265359

尝试用

加载

import numpy as np
pi = np.loadtxt('data.txt')
print(pi)

失败,并出现UnicodeEncodeError异常,指出拉丁1编解码器无法对字符"\u03c0"(π字符)进行编码.

这令人沮丧,因为π仅出现在注释/标题行中,因此loadtxt甚至没有理由甚至尝试对该字符进行编码.

我可以通过使用pi = np.loadtxt('data.txt', skiprows=1)显式跳过第一行来成功读取文件,但是要知道标题行的确切数目并不方便.

如果我尝试使用savetxt编写 Unicode字符,则会引发相同的异常:

np.savetxt('data.txt', [3.14159265359], header='# This is π')

要成功完成此任务,我首先必须通过其他方式写入标头,然后将数据保存到以'a+b'模式打开的文件对象中,例如

with open('data.txt', 'w') as f:
    f.write('# This is π\n')
with open('data.txt', 'a+b') as f:
    np.savetxt(f, [3.14159265359])

不用说,这既丑陋又不方便.

解决方案

我选择了hpaulj的解决方案,我认为可以很好地阐明.我现在在程序顶部附近

import numpy as np

asbytes = lambda s: s if isinstance(s, bytes) else str(s).encode('utf-8')
asstr = lambda s: s.decode('utf-8') if isinstance(s, bytes) else str(s)
np.compat.py3k.asbytes = asbytes
np.compat.py3k.asstr = asstr
np.compat.py3k.asunicode = asstr
np.lib.npyio.asbytes = asbytes
np.lib.npyio.asstr = asstr
np.lib.npyio.asunicode = asstr

之后,np.loadtxtnp.savetxt正确处理Unicode.

请注意,对于较新版本的NumPy(我可以确认1.14.3,但也适当地使用了较旧的版本),不需要此技巧,因为似乎默认情况下现在可以正确处理Unicode.

解决方案

至少对于savetxt,编码是在

中处理的

Signature: np.lib.npyio.asbytes(s)
Source:   
    def asbytes(s):
        if isinstance(s, bytes):
            return s
        return str(s).encode('latin1')
File:      /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py
Type:      function

Signature: np.lib.npyio.asstr(s)
Source:   
    def asstr(s):
        if isinstance(s, bytes):
            return s.decode('latin1')
        return str(s)
File:      /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py
Type:      function

标头使用

写入wb文件

        header = header.replace('\n', '\n' + comments)
        fh.write(asbytes(comments + header + newline))

将numpy unicode数组写入文本文件具有我以前的一些探索.在那儿我专注于数据中的字符,而不是标题.

Using the NumPy loadtxt and savetxt functions fails whenever non-ASCII characters are involved. These function are primarily ment for numeric data, but alphanumeric headers/footers are also supported.

Both loadtxt and savetxt seem to be applying the latin-1 encoding, which I find very orthogonal to the rest of Python 3, which is thoroughly unicode-aware and always seem to be using utf-8 as the default encoding.

Given that NumPy hasn't moved to utf-8 as the default encoding, can I at least change the encoding away from latin-1, either via some implemented function/attribute or a known hack, either just for loadtxt/savetxt or for NumPy in its entirety?

That this is not possible with Python 2 is forgivable, but it really should not be a problem when using Python 3. I've found the problem using any combination of Python 3.x and the last many versions of NumPy.

Example code

Consider the file data.txt with the content

# This is π
3.14159265359

Trying to load this with

import numpy as np
pi = np.loadtxt('data.txt')
print(pi)

fails with a UnicodeEncodeError exception, stating that the latin-1 codec can't encode the character '\u03c0' (the π character).

This is frustrating because π is only present in a comment/header line, so there is no reason for loadtxt to even attempt to encode this character.

I can successfully read in the file by explicitly skipping the first row, using pi = np.loadtxt('data.txt', skiprows=1), but it is inconvenient to have to know the exact number of header lines.

The same exception is thrown if I try to write a unicode character using savetxt:

np.savetxt('data.txt', [3.14159265359], header='# This is π')

To accomplish this task successfully, I first have to write the header by some other means, and then save the data to a file object opened with the 'a+b' mode, e.g.

with open('data.txt', 'w') as f:
    f.write('# This is π\n')
with open('data.txt', 'a+b') as f:
    np.savetxt(f, [3.14159265359])

which needless to say is both ugly and inconvenient.

Solution

I settled on the solution by hpaulj, which I thought would be nice to spell out fully. Near the top of my program I now do

import numpy as np

asbytes = lambda s: s if isinstance(s, bytes) else str(s).encode('utf-8')
asstr = lambda s: s.decode('utf-8') if isinstance(s, bytes) else str(s)
np.compat.py3k.asbytes = asbytes
np.compat.py3k.asstr = asstr
np.compat.py3k.asunicode = asstr
np.lib.npyio.asbytes = asbytes
np.lib.npyio.asstr = asstr
np.lib.npyio.asunicode = asstr

after which np.loadtxt and np.savetxt handles Unicode correctly.

Note that for newer versions of NumPy (I can confirm 1.14.3, but properly somewhat older versions as well) this trick is not needed, as it seems that Unicode is now handled properly by default.

解决方案

At least for savetxt the encodings are handled in

Signature: np.lib.npyio.asbytes(s)
Source:   
    def asbytes(s):
        if isinstance(s, bytes):
            return s
        return str(s).encode('latin1')
File:      /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py
Type:      function

Signature: np.lib.npyio.asstr(s)
Source:   
    def asstr(s):
        if isinstance(s, bytes):
            return s.decode('latin1')
        return str(s)
File:      /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py
Type:      function

The header is written to the wb file with

        header = header.replace('\n', '\n' + comments)
        fh.write(asbytes(comments + header + newline))

Write numpy unicode array to a text file has some of my previous explorations. There I was focusing on characters in the data, not the header.

这篇关于使用NumPy loadtxt/savetxt指定编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆