导入带有numpy genfromtxt的嵌入特殊字符的csv [英] Importing csv embedding special character with numpy genfromtxt

查看:140
本文介绍了导入带有numpy genfromtxt的嵌入特殊字符的csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含特殊字符的CSV.一些单元格是算术运算(例如(10/2)"). 我想通过使用np.genfromtxt将这些单元格导入为numpy中的字符串. 我注意到的是,它实际上是将它们导入UTF8(如果我理解的话).例如,每当我有一个除号时,我都会在numpy数组中获得此代码:\ xc3 \ xb7

I have a CSV containing special characters. Some cells are arithmetic operations (like "(10/2)"). I would like to import these cells as string in numpy by using np.genfromtxt. What I notice is that it actually import them in UTF8 (if I understood). For instance everytime I have a division symbol I get this code in the numpy array :\xc3\xb7

如何将这些算术运算导入为可读字符串?

How could I import these arithmetic operations as readable string?

谢谢!

推荐答案

文件看起来像是"other"除号,这是我们在小学时学到的:

Looks like the file may have the 'other' divide symbol, the one we learn in grade school:

In [185]: b'\xc3\xb7'
Out[185]: b'\xc3\xb7'
In [186]: _.decode()
Out[186]: '÷'


最近的numpy版本可以更好地处理编码.较早的版本试图完全以字节串模式(对于Py3)工作,以便与Py2兼容.但是现在它需要一个encoding参数.


Recent numpy version(s) handle encoding better. Earlier ones tried to work entirely in bytestring mode (for Py3) to be compatible with Py2. But now it takes an encoding parameter.

In [68]: txt = '''(10/2), 1, 2
    ...: (10/2), 3,4'''

In [70]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',')
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
  #!/usr/bin/python3
Out[70]: 
array([(b'(10/2)', 1, 2), (b'(10/2)', 3, 4)],
      dtype=[('f0', 'S6'), ('f1', '<i8'), ('f2', '<i8')])

In [71]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',',encoding=None
    ...: )
Out[71]: 
array([('(10/2)', 1, 2), ('(10/2)', 3, 4)],
      dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])

诚然,从字符串列表进行的模拟加载与从文件加载不同.我没有安装较早的numpys(并且没有安装在Py2上),因此无法显示以前发生的情况.但是我的直觉是(10/2)"以前应该没有出现问题,至少在ASCII文件中没有.字符串中没有任何特殊字符.

Admittedly this simulated load from a list of strings is not the same as loading from a file. I don't have earlier numpys installed (and not on Py2), so can't show what happened before. But my gut feeling is that "(10/2)" shouldn't have given problems before, at least not in an ASCII file. There aren't any special characters in the string.

与另一个鸿沟:

In [192]: txt = '''(10÷2), 1, 2
     ...: (10÷2), 3,4'''
In [194]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',',encoding='ut
     ...: f8')
Out[194]: 
array([('(10÷2)', 1, 2), ('(10÷2)', 3, 4)],
      dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])

文件中有相同内容:

In [200]: np.genfromtxt('stack49859957.txt', dtype=None, delimiter=',')
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
  #!/usr/bin/python3
Out[200]: 
array([(b'(10\xf72)', 1, 2), (b'(10\xf72)', 3, 4)],
      dtype=[('f0', 'S6'), ('f1', '<i8'), ('f2', '<i8')])

In [199]: np.genfromtxt('stack49859957.txt', dtype=None, delimiter=',',encoding=
     ...: 'utf8')
Out[199]: 
array([('(10÷2)', 1, 2), ('(10÷2)', 3, 4)],
      dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])

在早期版本中,encoding可以在converter中实现.在以前的SO问题中,我已经为该任务提供了帮助.

In earlier versions, encoding could be implemented in a converter. I've helped with that task in previous SO questions.

这篇关于导入带有numpy genfromtxt的嵌入特殊字符的csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆