访问numpy元组数组中的第一项 [英] Accessing the first items in a numpy array of tuples
问题描述
我有一个pandas数据框,其中有一列包含由两个浮点数组成的元组,例如(1.1,2.2).我希望能够产生一个包含每个元组的第一个元素的数组.我可以遍历每一行并获取每个元组的第一个元素,但数据框包含近400万条记录,这种方法非常慢.satoru对SO的回答(stackoverflow.com/questions/6454894/reference-an-element-in-a-a-list-of-tuples)建议使用以下机制:
I have a pandas dataframe that has a column that contains tuples made up of two floats e.g. (1.1,2.2). I want to be able to produce an array that contains the first element of each tuple. I could step through each row and get the first element of each tuple but the dataframe contains almost 4 million records and such an approach is very slow. An answer by satoru on SO (stackoverflow.com/questions/6454894/reference-an-element-in-a-list-of-tuples) suggests using the following mechanism:
>>> import numpy as np
>>> arr = np.array([(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8)])
>>> arr
array([[ 1.1, 2.2],
[ 3.3, 4.4],
[ 5.5, 6.6],
[ 7.7, 8.8]])
>>> arr[:,0]
array([ 1.1, 3.3, 5.5, 7.7])
这样可以正常工作,并且绝对满足我的需求.但是,当我尝试从熊猫数据帧创建numpy数组时,出现了我遇到的问题.在这种情况下,上述解决方案将因各种错误而失败.例如:
So that works fine and would be absolutely perfect for my needs. However, the problem I have occurs when I try to create a numpy array from a pandas dataframe. In that case, the above solution fails with a variety of errors. For example:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
>>> df
other point
0 0 (1.1, 2.2)
1 0 (3.3, 4.4)
2 0 (5.5, 6.6)
3 1 (7.7, 8.8)
4 1 (9.9, 0.0)
>>> arr2 = np.array(df['point'])
>>> arr2
array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)
>>> arr2[:,0]
IndexError: too many indices for array
或者:
>>> arr2 = np.array([df['point']])
>>> arr2
array([[[1.1, 2.2],
[3.3, 4.4],
[5.5, 6.6],
[7.7, 8.8],
[9.9, 0.0]]], dtype=object)
>>> arr2[:,0]
array([[1.1, 2.2]], dtype=object) # Which is not what I want!
当我将数据从pandas数据帧传输到numpy数组时,似乎出现了问题-但我不知道该怎么办.任何建议将不胜感激.
Something seems to be going wrong when I transfer data from the pandas dataframe to a numpy array - but I've no idea what. Any suggestions would be gratefully received.
推荐答案
从您的数据帧开始,我可以使用以下方法提取(5,2)
数组:
Starting with your dataframe, I can extract a (5,2)
array with:
In [68]: df=pandas.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
In [69]: np.array(df['point'].tolist())
Out[69]:
array([[ 1.1, 2.2],
[ 3.3, 4.4],
[ 5.5, 6.6],
[ 7.7, 8.8],
[ 9.9, 0. ]])
df ['point']
是熊猫系列.
df ['point'].values
返回形状为(5,)
和dtype object
的数组.我
df['point'].values
returns an array of shape (5,)
, and dtype object
. I
array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)
实际上是一个元组数组.真正的元组,而不是结构化数组类似元组.该数组实际上包含指向元组的指针,而元组在内存中的其他位置.它的形状是(5,)
-它是一维数组,因此尝试像2d那样进行索引将给您太多"的错误. np.array([df ['point']])
仅将其包装在另一个维度中,而没有解决基本对象dtype问题.
It is, in effect, an array of tuples. Real tuples, not the structured array tuple-look-a-likes. The array actually contains pointers to the tuples, which are else where in memory. Its shape is (5,)
- it's a 1d array, so trying to index as though it were 2d will give you the 'too many' error. np.array([df['point']])
just wraps it in another dimension, without addressing the fundamental object dtype issue.
tolist()
将其转换为元组列表,您可以从中创建2d数组.
tolist()
converts it to a list of tuples, from which you can build the 2d array.
将数据从对象数组复制到n-d数组并非易事,并且总是需要某种复制.数据缓冲区完全不同,因此 astype
之类的东西不起作用.
Copying data from arrays of objects to n-d arrays is not trivial, and invariably requires some sort of copying. The data buffers are entirely different, so things like astype
don't work.
这篇关于访问numpy元组数组中的第一项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!