当第一列是字符串而其余列是数字时,如何使用 numpy.genfromtxt? [英] How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?

查看:29
本文介绍了当第一列是字符串而其余列是数字时,如何使用 numpy.genfromtxt?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我有一堆数据,其中第一列是字符串(标签),其余列是数值.我运行以下:

data = numpy.genfromtxt('data.txt', delimiter = ',')

这可以很好地读取大部分数据,但标签列只是获取nan".我该如何处理?

解决方案

默认情况下,np.genfromtxt 使用 dtype=float:这就是为什么您将字符串列转换为NaN 因为毕竟它们不是数字...

你可以让 np.genfromtxt 尝试使用 dtype=None 来猜测你的列的实际类型:

<预><代码>>>>从 StringIO 导入 StringIO>>>测试 = "a,1,2 b,3,4">>>a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)>>>打印一个数组([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])

您可以使用列的名称访问列,例如 a['f0']...

如果你不知道你的列应该是什么,使用 dtype=None 是一个很好的技巧.如果你已经知道他们应该有什么类型,你可以给出一个明确的 dtype.例如,在我们的测试中,我们知道第一列是字符串,第二列是 int,我们希望第三列是浮点数.然后我们将使用

<预><代码>>>>np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))数组([('a', 1, 2.0), ('b', 3, 4.0)],dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])

使用显式 dtype 比使用 dtype=None 高效得多,并且是推荐的方法.

在这两种情况下(dtype=None 或显式、非同质的 dtype),您最终都会得到一个结构化数组.

[注意:使用 dtype=None,第二次解析输入并更新每列的类型以匹配可能的更大类型:首先我们尝试 bool,然后是 int,然后是一个浮点数,然后是一个复数,如果所有其他方法都失败了,我们保留一个字符串.实际上,实现相当笨拙.已经有一些尝试使类型猜测更有效(使用正则表达式),但到目前为止没有任何问题]

Basically, I have a bunch of data where the first column is a string (label) and the remaining columns are numeric values. I run the following:

data = numpy.genfromtxt('data.txt', delimiter = ',')

This reads most of the data well, but the label column just gets 'nan'. How can I deal with this?

解决方案

By default, np.genfromtxt uses dtype=float: that's why you string columns are converted to NaNs because, after all, they're Not A Number...

You can ask np.genfromtxt to try to guess the actual type of your columns by using dtype=None:

>>> from StringIO import StringIO
>>> test = "a,1,2
b,3,4"
>>> a = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)
>>> print a
array([('a',1,2),('b',3,4)], dtype=[('f0', '|S1'),('f1', '<i8'),('f2', '<i8')])

You can access the columns by using their name, like a['f0']...

Using dtype=None is a good trick if you don't know what your columns should be. If you already know what type they should have, you can give an explicit dtype. For example, in our test, we know that the first column is a string, the second an int, and we want the third to be a float. We would then use

>>> np.genfromtxt(StringIO(test), delimiter=",", dtype=("|S10", int, float))
array([('a', 1, 2.0), ('b', 3, 4.0)], 
      dtype=[('f0', '|S10'), ('f1', '<i8'), ('f2', '<f8')])

Using an explicit dtype is much more efficient than using dtype=None and is the recommended way.

In both cases (dtype=None or explicit, non-homogeneous dtype), you end up with a structured array.

[Note: With dtype=None, the input is parsed a second time and the type of each column is updated to match the larger type possible: first we try a bool, then an int, then a float, then a complex, then we keep a string if all else fails. The implementation is rather clunky, actually. There had been some attempts to make the type guessing more efficient (using regexp), but nothing that stuck so far]

这篇关于当第一列是字符串而其余列是数字时,如何使用 numpy.genfromtxt?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆