当第一列是字符串而其余列是数字时,如何使用 numpy.genfromtxt? [英] How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?
问题描述
基本上,我有一堆数据,其中第一列是字符串(标签),其余列是数值.我运行以下:
data = numpy.genfromtxt('data.txt', delimiter = ',')
这可以很好地读取大部分数据,但标签列只是获取nan".我该如何处理?
默认情况下,np.genfromtxt
使用 dtype=float
:这就是为什么您将字符串列转换为NaN 因为毕竟它们不是数字...
你可以让 np.genfromtxt
尝试使用 dtype=None
来猜测你的列的实际类型:
您可以使用列的名称访问列,例如 a['f0']
...
如果你不知道你的列应该是什么,使用 dtype=None
是一个很好的技巧.如果你已经知道他们应该有什么类型,你可以给出一个明确的 dtype
.例如,在我们的测试中,我们知道第一列是字符串,第二列是 int,我们希望第三列是浮点数.然后我们将使用
使用显式 dtype
比使用 dtype=None
高效得多,并且是推荐的方法.
在这两种情况下(dtype=None
或显式、非同质的 dtype
),您最终都会得到一个结构化数组.
[注意:使用 dtype=None
,第二次解析输入并更新每列的类型以匹配可能的更大类型:首先我们尝试 bool,然后是 int,然后是一个浮点数,然后是一个复数,如果所有其他方法都失败了,我们保留一个字符串.实际上,实现相当笨拙.已经有一些尝试使类型猜测更有效(使用正则表达式),但到目前为止没有任何问题]
Basically, I have a bunch of data where the first column is a string (label) and the remaining columns are numeric values. I run the following:
data = numpy.genfromtxt('data.txt', delimiter = ',')
This reads most of the data well, but the label column just gets 'nan'. How can I deal with this?
By default, np.genfromtxt
uses dtype=float
: that's why you string columns are converted to NaNs because, after all, they're Not A Number...
You can ask np.genfromtxt
to try to guess the actual type of your columns by using dtype=None
:
>>> from StringIO import StringIO
>>> test = "a,1,2
b,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])
You can access the columns by using their name, like a['f0']
...
Using dtype=None
is a good trick if you don't know what your columns should be. If you already know what type they should have, you can give an explicit dtype
. For example, in our test, we know that the first column is a string, the second an int, and we want the third to be a float. We would then use
>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)],
dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])
Using an explicit dtype
is much more efficient than using dtype=None
and is the recommended way.
In both cases (dtype=None
or explicit, non-homogeneous dtype
), you end up with a structured array.
[Note: With dtype=None
, the input is parsed a second time and the type of each column is updated to match the larger type possible: first we try a bool, then an int, then a float, then a complex, then we keep a string if all else fails. The implementation is rather clunky, actually. There had been some attempts to make the type guessing more efficient (using regexp), but nothing that stuck so far]
这篇关于当第一列是字符串而其余列是数字时,如何使用 numpy.genfromtxt?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!