导入带有numpy genfromtxt的嵌入特殊字符的csv [英] Importing csv embedding special character with numpy genfromtxt
问题描述
我有一个包含特殊字符的CSV.一些单元格是算术运算(例如(10/2)"). 我想通过使用np.genfromtxt将这些单元格导入为numpy中的字符串. 我注意到的是,它实际上是将它们导入UTF8(如果我理解的话).例如,每当我有一个除号时,我都会在numpy数组中获得此代码:\ xc3 \ xb7
I have a CSV containing special characters. Some cells are arithmetic operations (like "(10/2)"). I would like to import these cells as string in numpy by using np.genfromtxt. What I notice is that it actually import them in UTF8 (if I understood). For instance everytime I have a division symbol I get this code in the numpy array :\xc3\xb7
如何将这些算术运算导入为可读字符串?
How could I import these arithmetic operations as readable string?
谢谢!
推荐答案
文件看起来像是"other"除号,这是我们在小学时学到的:
Looks like the file may have the 'other' divide symbol, the one we learn in grade school:
In [185]: b'\xc3\xb7'
Out[185]: b'\xc3\xb7'
In [186]: _.decode()
Out[186]: '÷'
最近的numpy版本可以更好地处理编码.较早的版本试图完全以字节串模式(对于Py3)工作,以便与Py2兼容.但是现在它需要一个encoding
参数.
Recent numpy version(s) handle encoding better. Earlier ones tried to work entirely in bytestring mode (for Py3) to be compatible with Py2. But now it takes an encoding
parameter.
In [68]: txt = '''(10/2), 1, 2
...: (10/2), 3,4'''
In [70]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',')
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
#!/usr/bin/python3
Out[70]:
array([(b'(10/2)', 1, 2), (b'(10/2)', 3, 4)],
dtype=[('f0', 'S6'), ('f1', '<i8'), ('f2', '<i8')])
In [71]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',',encoding=None
...: )
Out[71]:
array([('(10/2)', 1, 2), ('(10/2)', 3, 4)],
dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])
诚然,从字符串列表进行的模拟加载与从文件加载不同.我没有安装较早的numpys(并且没有安装在Py2上),因此无法显示以前发生的情况.但是我的直觉是(10/2)"以前应该没有出现问题,至少在ASCII文件中没有.字符串中没有任何特殊字符.
Admittedly this simulated load from a list of strings is not the same as loading from a file. I don't have earlier numpys installed (and not on Py2), so can't show what happened before. But my gut feeling is that "(10/2)" shouldn't have given problems before, at least not in an ASCII file. There aren't any special characters in the string.
与另一个鸿沟:
In [192]: txt = '''(10÷2), 1, 2
...: (10÷2), 3,4'''
In [194]: np.genfromtxt(txt.splitlines(), dtype=None, delimiter=',',encoding='ut
...: f8')
Out[194]:
array([('(10÷2)', 1, 2), ('(10÷2)', 3, 4)],
dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])
文件中有相同内容:
In [200]: np.genfromtxt('stack49859957.txt', dtype=None, delimiter=',')
/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
#!/usr/bin/python3
Out[200]:
array([(b'(10\xf72)', 1, 2), (b'(10\xf72)', 3, 4)],
dtype=[('f0', 'S6'), ('f1', '<i8'), ('f2', '<i8')])
In [199]: np.genfromtxt('stack49859957.txt', dtype=None, delimiter=',',encoding=
...: 'utf8')
Out[199]:
array([('(10÷2)', 1, 2), ('(10÷2)', 3, 4)],
dtype=[('f0', '<U6'), ('f1', '<i8'), ('f2', '<i8')])
在早期版本中,encoding
可以在converter
中实现.在以前的SO问题中,我已经为该任务提供了帮助.
In earlier versions, encoding
could be implemented in a converter
. I've helped with that task in previous SO questions.
这篇关于导入带有numpy genfromtxt的嵌入特殊字符的csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!