以编程方式将列名称添加到numpy ndarray [英] Programmatically add column names to numpy ndarray

查看:165
本文介绍了以编程方式将列名称添加到numpy ndarray的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将列名称添加到numpy ndarray,然后按其名称选择列.但这是行不通的.添加名称或稍后尝试调用它们时,我无法确定问题是否出现.

I'm trying to add column names to a numpy ndarray, then select columns by their names. But it doesn't work. I can't tell if the problem occurs when I add the names, or later when I try to call them.

这是我的代码.

data = np.genfromtxt(csv_file, delimiter=',', dtype=np.float, skip_header=1)

#Add headers
csv_names = [ s.strip('"') for s in file(csv_file,'r').readline().strip().split(',')]
data = data.astype(np.dtype( [(n, 'float64') for n in csv_names] ))

基于维度的诊断符合我的期望:

Dimension-based diagnostics match what I expect:

print len(csv_names)
>> 108
print data.shape
>> (1652, 108)

打印data.dtype.names"也会返回预期的输出.

"print data.dtype.names" also returns the expected output.

但是,当我开始使用字段名称来调用列时,就会发生一些棘手的事情. 列"仍然是具有108列的数组...

But when I start calling columns by their field names, screwy things happen. The "column" is still an array with 108 columns...

print data["EDUC"].shape
>> (1652, 108)

...,它似乎包含的缺失值比数据集中的行还要多.

... and it appears to contain more missing values than there are rows in the data set.

print np.sum(np.isnan(data["EDUC"]))
>> 27976

你知道这里出了什么问题吗?添加标头应该是一件小事,但是我已经为这个bug争取了好几个小时.救命!

Any idea what's going wrong here? Adding headers should be a trivial operation, but I've been fighting this bug for hours. Help!

推荐答案

问题是您在考虑类似电子表格的数组,而NumPy确实使用了不同的概念.

The problem is that you are thinking in terms of spreadsheet-like arrays, whereas NumPy does use different concepts.

关于NumPy,您必须了解以下内容:

Here is what you must know about NumPy:

  1. NumPy数组仅包含单一类型的元素.
  2. 如果您需要类似电子表格的列",则此类型必须为某些类似元组的类型.这种数组称为结构化数组,因为它们的元素是结构(即元组).
  1. NumPy arrays only contain elements of a single type.
  2. If you need spreadsheet-like "columns", this type must be some tuple-like type. Such arrays are called Structured Arrays, because their elements are structures (i.e. tuples).

在您的情况下,NumPy将采用二维正则数组并生成一个维数组,其类型为108元素元组(您正在考虑的电子表格数组为2 -维度).

In your case, NumPy would thus take your 2-dimensional regular array and produce a one-dimensional array whose type is a 108-element tuple (the spreadsheet array that you are thinking of is 2-dimensional).

之所以选择这些选项,可能是出于效率方面的考虑:数组的所有元素都具有相同的类型,因此具有相同的大小:可以在底层轻松,快速地访问它们.

These choices were probably made for efficiency reasons: all the elements of an array have the same type and therefore have the same size: they can be accessed, at a low-level, very simply and quickly.

现在,如user545424所示,对于您要执行的操作有一个简单的NumPy答案(genfromtxt()接受带有列名的names自变量).

Now, as user545424 showed, there is a simple NumPy answer to what you want to do (genfromtxt() accepts a names argument with column names).

如果要将数组从常规NumPy ndarray转换为结构化数组,可以执行以下操作:

If you want to convert your array from a regular NumPy ndarray to a structured array, you can do:

data.view(dtype=[(n, 'float64') for n in csv_names]).reshape(len(data))

(您很亲近:您使用的是astype()而不是view()).

(you were close: you used astype() instead of view()).

您还可以查看许多Stackoverflow问题的答案,包括如何将常规的numpy数组转换为记录数组?.

You can also check the answers to quite a few Stackoverflow questions, including Converting a 2D numpy array to a structured array and how to convert regular numpy array to record array?.

这篇关于以编程方式将列名称添加到numpy ndarray的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆