访问numpy元组数组中的第一项 [英] Accessing the first items in a numpy array of tuples

查看:147
本文介绍了访问numpy元组数组中的第一项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据框,其中有一列包含由两个浮点数组成的元组,例如(1.1,2.2).我希望能够产生一个包含每个元组的第一个元素的数组.我可以遍历每一行并获取每个元组的第一个元素,但数据框包含近400万条记录,这种方法非常慢.satoru对SO的回答(stackoverflow.com/questions/6454894/reference-an-element-in-a-a-list-of-tuples)建议使用以下机制:

I have a pandas dataframe that has a column that contains tuples made up of two floats e.g. (1.1,2.2). I want to be able to produce an array that contains the first element of each tuple. I could step through each row and get the first element of each tuple but the dataframe contains almost 4 million records and such an approach is very slow. An answer by satoru on SO (stackoverflow.com/questions/6454894/reference-an-element-in-a-list-of-tuples) suggests using the following mechanism:

>>> import numpy as np
>>> arr = np.array([(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8)])
>>> arr
array([[ 1.1,  2.2],
       [ 3.3,  4.4],
       [ 5.5,  6.6],
       [ 7.7,  8.8]])
>>> arr[:,0]
array([ 1.1,  3.3,  5.5,  7.7])

这样可以正常工作,并且绝对满足我的需求.但是,当我尝试从熊猫数据帧创建numpy数组时,出现了我遇到的问题.在这种情况下,上述解决方案将因各种错误而失败.例如:

So that works fine and would be absolutely perfect for my needs. However, the problem I have occurs when I try to create a numpy array from a pandas dataframe. In that case, the above solution fails with a variety of errors. For example:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
>>> df
   other       point
0      0  (1.1, 2.2)
1      0  (3.3, 4.4)
2      0  (5.5, 6.6)
3      1  (7.7, 8.8)
4      1  (9.9, 0.0)
>>> arr2 = np.array(df['point'])
>>> arr2
array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)
>>> arr2[:,0]
IndexError: too many indices for array

或者:

>>> arr2 = np.array([df['point']])
>>> arr2
array([[[1.1, 2.2],
        [3.3, 4.4],
        [5.5, 6.6],
        [7.7, 8.8],
        [9.9, 0.0]]], dtype=object)
>>> arr2[:,0]
array([[1.1, 2.2]], dtype=object)   # Which is not what I want!

当我将数据从pandas数据帧传输到numpy数组时,似乎出现了问题-但我不知道该怎么办.任何建议将不胜感激.

Something seems to be going wrong when I transfer data from the pandas dataframe to a numpy array - but I've no idea what. Any suggestions would be gratefully received.

推荐答案

从您的数据帧开始,我可以使用以下方法提取(5,2)数组:

Starting with your dataframe, I can extract a (5,2) array with:

In [68]: df=pandas.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})

In [69]: np.array(df['point'].tolist())
Out[69]: 
array([[ 1.1,  2.2],
       [ 3.3,  4.4],
       [ 5.5,  6.6],
       [ 7.7,  8.8],
       [ 9.9,  0. ]])

df ['point'] 是熊猫系列.

df ['point'].values 返回形状为(5,)和dtype object 的数组.我

df['point'].values returns an array of shape (5,), and dtype object. I

array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)

实际上是一个元组数组.真正的元组,而不是结构化数组类似元组.该数组实际上包含指向元组的指针,而元组在内存中的其他位置.它的形状是(5,)-它是一维数组,因此尝试像2d那样进行索引将给您太多"的错误. np.array([df ['point']])仅将其包装在另一个维度中,而没有解决基本对象dtype问题.

It is, in effect, an array of tuples. Real tuples, not the structured array tuple-look-a-likes. The array actually contains pointers to the tuples, which are else where in memory. Its shape is (5,) - it's a 1d array, so trying to index as though it were 2d will give you the 'too many' error. np.array([df['point']]) just wraps it in another dimension, without addressing the fundamental object dtype issue.

tolist()将其转换为元组列表,您可以从中创建2d数组.

tolist() converts it to a list of tuples, from which you can build the 2d array.

将数据从对象数组复制到n-d数组并非易事,并且总是需要某种复制.数据缓冲区完全不同,因此 astype 之类的东西不起作用.

Copying data from arrays of objects to n-d arrays is not trivial, and invariably requires some sort of copying. The data buffers are entirely different, so things like astype don't work.

这篇关于访问numpy元组数组中的第一项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆