当第一列是字符串而其余列是数字时,如何使用numpy.genfromtxt? [英] How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?
问题描述
基本上,我有一堆数据,其中第一列是字符串(标签),其余列是数字值.我运行以下命令:
Basically, I have a bunch of data where the first column is a string (label) and the remaining columns are numeric values. I run the following:
data = numpy.genfromtxt('data.txt', delimiter = ',')
这可以很好地读取大多数数据,但是label列只是获取"nan".我该如何处理?
This reads most of the data well, but the label column just gets 'nan'. How can I deal with this?
推荐答案
默认情况下,np.genfromtxt
使用dtype=float
:这就是为什么将字符串列转换为NaN的原因,因为它们毕竟不是数字.
By default, np.genfromtxt
uses dtype=float
: that's why you string columns are converted to NaNs because, after all, they're Not A Number...
您可以要求np.genfromtxt
尝试使用dtype=None
来猜测列的实际类型:
You can ask np.genfromtxt
to try to guess the actual type of your columns by using dtype=None
:
>>> from StringIO import StringIO
>>> test = "a,1,2\nb,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])
您可以使用列名来访问列,例如a['f0']
...
You can access the columns by using their name, like a['f0']
...
如果您不知道列应该是什么,那么使用dtype=None
是一个好技巧.如果您已经知道它们应具有的类型,则可以给出一个明确的dtype
.例如,在我们的测试中,我们知道第一列是一个字符串,第二列是一个int,我们希望第三列是一个浮点数.然后,我们将使用
Using dtype=None
is a good trick if you don't know what your columns should be. If you already know what type they should have, you can give an explicit dtype
. For example, in our test, we know that the first column is a string, the second an int, and we want the third to be a float. We would then use
>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)],
dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])
使用显式dtype
比使用dtype=None
更有效,这是推荐的方法.
Using an explicit dtype
is much more efficient than using dtype=None
and is the recommended way.
在两种情况下(dtype=None
或显式,非均质的dtype
),最终都会得到结构化数组.
In both cases (dtype=None
or explicit, non-homogeneous dtype
), you end up with a structured array.
[注意:使用dtype=None
,第二次解析输入,并且更新各列的类型以匹配可能的更大类型:首先我们尝试使用bool,然后尝试使用int,float,然后使用复数,那么如果所有其他方法都失败,我们将保留一个字符串.实际上,该实现相当笨拙.已经进行了一些尝试来使类型猜测更加有效(使用regexp),但是到目前为止还没有任何解决方法]
[Note: With dtype=None
, the input is parsed a second time and the type of each column is updated to match the larger type possible: first we try a bool, then an int, then a float, then a complex, then we keep a string if all else fails. The implementation is rather clunky, actually. There had been some attempts to make the type guessing more efficient (using regexp), but nothing that stuck so far]
这篇关于当第一列是字符串而其余列是数字时,如何使用numpy.genfromtxt?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!