当函数应用于行时, pandas 的.groupby函数出现奇怪问题 [英] Bizarre issue with pandas' .groupby function, when function applied to rows
问题描述
我有一组4203x37
的CSV数据,我将其重塑为50436x4
以便找到在每个时间步记录的12组3D点之间的欧几里得距离.这对我的实际数据不起作用,但是奇怪的是,当我用有效的随机数重新创建数据时,代码如下...
I have a set of CSV data that is 4203x37
which I reshape to 50436x4
in order to find the Euclidean distance between 12 sets of 3D points, recorded at each time-step. This does not work for my actual data, but bizarrely enough when I recreated the data with random numbers it works, code follows...
这是我的实际数据的代码,那是行不通的.
Here is the code for my actual data, the one which does not work.
df_f_2_norm = df_f.loc[:,'Time':'label37'] # Select columns
N = 12 # Nr of points
# Drop label1 column for later use
df_f_2_norm_time = df_f_2_norm['Time']
df_f_2_norm = df_f_2_norm.drop('Time',1)
# Get shape of data frame
shp = df_f_2_norm.shape
# Use numpy.reshape to reshape the underlying data in the DataFrame
df_f_2_norm = pd.DataFrame(df_f_2_norm.values.reshape(-1,3),columns=list('XYZ'))
df_f_2_norm["Time"] = np.repeat(np.array(df_f_2_norm_time), N) # Number of points per time-label: 12
# Find the Euclidean distance (2-norm)
N_lim = int(0.5*N*(N-1))
result_index = ['D{}'.format(tag) for tag in range(1,N_lim+1)] # Column labels
two_norm = df_f_2_norm.groupby('Time')[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g), index=result_index))
现在,如果我们看一下two_norm
的形状,它应该具有4203x66
的形状,即66个欧氏距离代表12个点,每个时间戳有4203个,每行一个.
Now if we look at the shape of the two_norm
it should have the shape of 4203x66
i.e. 66 euclidean distances for 12 points, per time-stamp of which there are 4203, one per row.
实际上给出的实际答案是:AssertionError: Index length did not match values
-所以它不喜欢我给它的列标签.很好,如果我们删除标签,而是改为
What the actual answer comes out as is in fact: AssertionError: Index length did not match values
- so it doesn't like the column labels that I have given it. Fine, if we remove the labels and just do instead
two_norm = df_f_2_norm.groupby('Time')[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g))
然后我们得到的形状(print two_norm.shape
)为(8307846,)
-我不太了解这里发生了什么,但是看来它甚至没有将所有结果叠加在一起.
Then we get a shape (print two_norm.shape
) of (8307846,)
- I do not quite understand what has happened here, but it appears that it is not even stacking all the results on top of each other.
这会变得更好,因为下面的代码会一直工作到第1140行,所以如果我们让
It gets better though, because the following code does work up until row 1140, so if we let
df_f_2_norm = df_f_2_norm[:1140]
然后我们得到以下形状:(95,66)
Then we get the following shape: (95,66)
到那时为止哪个是正确的,但是如果我们这样做
Which is correct up until that point, but if we do
df_f_2_norm = df_f_2_norm[:1152]
相反,它给出:(6480,)
因此,显然那里有些东西变成了梨形,但是如果我们实际查看该点附近的数据,似乎没有什么奇怪的.
So something has obviously gone pear-shaped there, but if we actually look at the data around that point, nothing appears to strange.
X Y Z Time
1127 -614.770 207.624 120.859 2.533
1128 791.318 291.591 64.160 2.550
1129 728.892 283.473 -207.306 2.550
1130 939.871 251.387 -145.103 2.550
1131 702.987 287.165 398.151 2.550
1132 480.309 285.745 590.925 2.550
1133 723.493 248.699 607.543 2.550
1134 255.664 183.618 -108.176 2.550
1135 -90.333 196.879 -261.102 2.550
1136 -442.132 236.314 -419.216 2.550
1137 133.428 216.805 242.896 2.550
1138 -242.201 192.100 191.588 2.550
1139 -616.844 210.060 123.202 2.550
1140 -655.054 1390.084 -359.369 1.100
1141 -726.517 1222.015 -590.799 1.100
1142 -671.655 1146.959 -797.080 1.100
1143 -762.048 1379.722 8.505 1.100
1144 -981.748 1169.959 72.773 1.100
1145 -1011.853 968.364 229.070 1.100
1146 -778.290 827.571 -370.463 1.100
1147 -761.608 460.835 -329.487 1.100
1148 -815.330 77.501 -314.721 1.100
1149 -925.764 831.944 -34.206 1.100
1150 -1009.297 475.362 -73.077 1.100
1151 -1193.310 139.839 -142.666 1.100
1152 -631.630 1388.573 -353.642 1.117
1153 -697.771 1234.274 -593.501 1.117
所以这很奇怪.因此,我尝试使用随机数来复制问题,但即使是标签,它也可以正常工作,
So that is just odd. So I tried to replicate the problem with random numbers but it all works perfectly, even the labels, which just makes no sense...
import numpy as np
import pandas as pd
import string
from scipy.spatial.distance import pdist, squareform
# Computes the distance between m points using Euclidean distance (2-norm)
# as the distance metric between the points. The points are arranged as m
# n-dimensional row vectors in the matrix X.
# Test data frame
N = 12 # Nr of points
col_ids = string.letters[:N]
df = pd.DataFrame(
np.random.randn(4203, 3*N+1),
columns=['Time']+['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')])
# Drop time column for later use
df_time = df['Time']
df = df.drop('Time',1)
print df.shape
# Use numpy.reshape to reshape the underlying data in the DataFrame
df = pd.DataFrame(df.values.reshape(-1,3), columns=list('XYZ'))
df["Time"] = np.repeat(np.array(df_time), N)
print df.shape
# Find the Euclidean distance (2-norm)
N_lim = int(0.5*N*(N-1))
result_index = ['D{}'.format(coord) for coord in range(1,N_lim+1)]
two_norm = df.groupby('Time')[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g), index=result_index))
print two_norm.shape
哪个输出(来自三个打印语句)
Which has the output (from the three print statements)
(4203, 36)
(50436, 4)
(4203, 66)
如您所见,最终结果的形状与应有的样子完全一样.但是,据我所知,这两组数据之间确实没有什么不同(据我所知),但数值上的差异应该不会对所得数据框的实际形状产生任何影响.
As you can see the shape of the final result is exactly as it should be. But there is truly nothing different (as far as I can tell) between these two sets of data, bar the numerical differences which should not have any impact on the actual shape of the resulting data frame.
我想念什么?
谢谢.
可以在此处找到原始数据(本文第一部分中使用的原始数据): https://www.dropbox.com/sh/80f8ue4ffa4067t/Pntl5-gUW4
Original data can be found here (the one used in the first part of this post): https://www.dropbox.com/sh/80f8ue4ffa4067t/Pntl5-gUW4
请注意,在保管箱中找到的.csv文件是数据框df_f_2_norm
-因此它不是原始数据,而是经过重整形的版本(因此上面代码的第一行不是需要执行才能达到此状态,因为它已经执行过.
It should be noted that the .csv file found in the dropbox is the data frame df_f_2_norm
- hence it is not the raw data, but the re-shaped version (so the first line of code in above, does not need to be executed to get to this state, as it has already been performed).
推荐答案
如果运行以下代码
df_f_2_norm.Time.value_counts()
然后您会发现并非所有时间值都有12行.
Then you can find that not all time value has 12 rows.
以下是输出:
1.333 492
1.383 492
1.317 492
1.400 492
1.467 492
1.450 492
1.483 492
1.417 492
1.500 492
1.367 492
1.350 492
1.433 492
1.533 480
1.517 480
1.550 468
...
4.800 12
4.600 12
4.750 12
4.833 12
4.667 12
4.700 12
4.650 12
4.683 12
4.633 12
4.617 12
4.817 12
4.583 12
4.733 12
4.767 12
4.783 12
Length: 272, dtype: int64
如果要每12行对数据框进行分组,则可以:
If you want to group the dataframe every 12 rows, you can:
import pandas as pd
from scipy.spatial.distance import pdist, squareform
df_f_2_norm = pd.read_csv("astrid_data.csv")
g = np.repeat(np.arange(df_f_2_norm.shape[0]//12), 12)
N = 12
N_lim = int(0.5*N*(N-1))
result_index = ['D{}'.format(tag) for tag in range(1,N_lim+1)] # Column labels
two_norm = df_f_2_norm.groupby(g)[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g), index=result_index))
这篇关于当函数应用于行时, pandas 的.groupby函数出现奇怪问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!