当第一列是字符串而其余列是数字时,如何使用numpy.genfromtxt? [英] How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?

查看:76
本文介绍了当第一列是字符串而其余列是数字时,如何使用numpy.genfromtxt?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我有一堆数据,其中第一列是字符串(标签),其余列是数字值.我运行以下命令:

Basically, I have a bunch of data where the first column is a string (label) and the remaining columns are numeric values. I run the following:

data = numpy.genfromtxt('data.txt', delimiter = ',')

这可以很好地读取大多数数据,但是label列只是获取"nan".我该如何处理?

This reads most of the data well, but the label column just gets 'nan'. How can I deal with this?

推荐答案

默认情况下,np.genfromtxt使用dtype=float:这就是为什么将字符串列转换为NaN的原因,因为它们毕竟不是数字.

By default, np.genfromtxt uses dtype=float: that's why you string columns are converted to NaNs because, after all, they're Not A Number...

您可以要求np.genfromtxt尝试使用dtype=None来猜测列的实际类型:

You can ask np.genfromtxt to try to guess the actual type of your columns by using dtype=None:

>>> from StringIO import StringIO
>>> test = "a,1,2\nb,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])

您可以使用列名来访问列,例如a['f0'] ...

You can access the columns by using their name, like a['f0']...

如果您不知道列应该是什么,那么使用dtype=None是一个好技巧.如果您已经知道它们应具有的类型,则可以给出一个明确的dtype.例如,在我们的测试中,我们知道第一列是一个字符串,第二列是一个int,我们希望第三列是一个浮点数.然后,我们将使用

Using dtype=None is a good trick if you don't know what your columns should be. If you already know what type they should have, you can give an explicit dtype. For example, in our test, we know that the first column is a string, the second an int, and we want the third to be a float. We would then use

>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)], 
      dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])

使用显式dtype比使用dtype=None更有效,这是推荐的方法.

Using an explicit dtype is much more efficient than using dtype=None and is the recommended way.

在两种情况下(dtype=None或显式,非均质的dtype),最终都会得到结构化数组.

In both cases (dtype=None or explicit, non-homogeneous dtype), you end up with a structured array.

[注意:使用dtype=None,第二次解析输入,并且更新各列的类型以匹配可能的更大类型:首先我们尝试使用bool,然后尝试使用int,float,然后使用复数,那么如果所有其他方法都失败,我们将保留一个字符串.实际上,该实现相当笨拙.已经进行了一些尝试来使类型猜测更加有效(使用regexp),但是到目前为止还没有任何解决方法]

[Note: With dtype=None, the input is parsed a second time and the type of each column is updated to match the larger type possible: first we try a bool, then an int, then a float, then a complex, then we keep a string if all else fails. The implementation is rather clunky, actually. There had been some attempts to make the type guessing more efficient (using regexp), but nothing that stuck so far]

这篇关于当第一列是字符串而其余列是数字时,如何使用numpy.genfromtxt?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆