当函数应用于行时, pandas 的.groupby函数出现奇怪问题 [英] Bizarre issue with pandas' .groupby function, when function applied to rows

查看:75
本文介绍了当函数应用于行时, pandas 的.groupby函数出现奇怪问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组4203x37的CSV数据,我将其重塑为50436x4以便找到在每个时间步记录的12组3D点之间的欧几里得距离.这对我的实际数据不起作用,但是奇怪的是,当我用有效的随机数重新创建数据时,代码如下...

I have a set of CSV data that is 4203x37 which I reshape to 50436x4 in order to find the Euclidean distance between 12 sets of 3D points, recorded at each time-step. This does not work for my actual data, but bizarrely enough when I recreated the data with random numbers it works, code follows...

这是我的实际数据的代码,那是行不通的.

Here is the code for my actual data, the one which does not work.

df_f_2_norm = df_f.loc[:,'Time':'label37'] # Select columns
N = 12 # Nr of points

# Drop label1 column for later use
df_f_2_norm_time = df_f_2_norm['Time']
df_f_2_norm = df_f_2_norm.drop('Time',1)

# Get shape of data frame
shp = df_f_2_norm.shape

# Use numpy.reshape to reshape the underlying data in the DataFrame
df_f_2_norm = pd.DataFrame(df_f_2_norm.values.reshape(-1,3),columns=list('XYZ'))
df_f_2_norm["Time"] = np.repeat(np.array(df_f_2_norm_time), N) # Number of points per time-label: 12

# Find the Euclidean distance (2-norm)
N_lim = int(0.5*N*(N-1)) 
result_index = ['D{}'.format(tag) for tag in range(1,N_lim+1)] # Column labels
two_norm = df_f_2_norm.groupby('Time')[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g), index=result_index))

现在,如果我们看一下two_norm的形状,它应该具有4203x66的形状,即66个欧氏距离代表12个点,每个时间戳有4203个,每行一个.

Now if we look at the shape of the two_norm it should have the shape of 4203x66 i.e. 66 euclidean distances for 12 points, per time-stamp of which there are 4203, one per row.

实际上给出的实际答案是:AssertionError: Index length did not match values-所以它不喜欢我给它的列标签.很好,如果我们删除标签,而是改为

What the actual answer comes out as is in fact: AssertionError: Index length did not match values - so it doesn't like the column labels that I have given it. Fine, if we remove the labels and just do instead

two_norm = df_f_2_norm.groupby('Time')[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g))

然后我们得到的形状(print two_norm.shape)为(8307846,)-我不太了解这里发生了什么,但是看来它甚至没有将所有结果叠加在一起.

Then we get a shape (print two_norm.shape) of (8307846,) - I do not quite understand what has happened here, but it appears that it is not even stacking all the results on top of each other.

这会变得更好,因为下面的代码会一直工作到第1140行,所以如果我们让

It gets better though, because the following code does work up until row 1140, so if we let

df_f_2_norm = df_f_2_norm[:1140]

然后我们得到以下形状:(95,66)

Then we get the following shape: (95,66)

到那时为止哪个是正确的,但是如果我们这样做

Which is correct up until that point, but if we do

df_f_2_norm = df_f_2_norm[:1152]

相反,它给出:(6480,)

因此,显然那里有些东西变成了梨形,但是如果我们实际查看该点附近的数据,似乎没有什么奇怪的.

So something has obviously gone pear-shaped there, but if we actually look at the data around that point, nothing appears to strange.

             X         Y        Z   Time
1127  -614.770   207.624  120.859  2.533
1128   791.318   291.591   64.160  2.550
1129   728.892   283.473 -207.306  2.550
1130   939.871   251.387 -145.103  2.550
1131   702.987   287.165  398.151  2.550
1132   480.309   285.745  590.925  2.550
1133   723.493   248.699  607.543  2.550
1134   255.664   183.618 -108.176  2.550
1135   -90.333   196.879 -261.102  2.550
1136  -442.132   236.314 -419.216  2.550
1137   133.428   216.805  242.896  2.550
1138  -242.201   192.100  191.588  2.550
1139  -616.844   210.060  123.202  2.550
1140  -655.054  1390.084 -359.369  1.100
1141  -726.517  1222.015 -590.799  1.100
1142  -671.655  1146.959 -797.080  1.100
1143  -762.048  1379.722    8.505  1.100
1144  -981.748  1169.959   72.773  1.100
1145 -1011.853   968.364  229.070  1.100
1146  -778.290   827.571 -370.463  1.100
1147  -761.608   460.835 -329.487  1.100
1148  -815.330    77.501 -314.721  1.100
1149  -925.764   831.944  -34.206  1.100
1150 -1009.297   475.362  -73.077  1.100
1151 -1193.310   139.839 -142.666  1.100
1152  -631.630  1388.573 -353.642  1.117
1153  -697.771  1234.274 -593.501  1.117

所以这很奇怪.因此,我尝试使用随机数来复制问题,但即使是标签,它也可以正常工作,

So that is just odd. So I tried to replicate the problem with random numbers but it all works perfectly, even the labels, which just makes no sense...

import numpy as np
import pandas as pd
import string
from scipy.spatial.distance import pdist, squareform
# Computes the distance between m points using Euclidean distance (2-norm)
# as the distance metric between the points. The points are arranged as m 
# n-dimensional row vectors in the matrix X.

# Test data frame
N = 12 # Nr of points
col_ids = string.letters[:N]
df = pd.DataFrame(
      np.random.randn(4203, 3*N+1), 
      columns=['Time']+['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')])

# Drop time column for later use
df_time = df['Time']
df = df.drop('Time',1)

print df.shape

# Use numpy.reshape to reshape the underlying data in the DataFrame
df = pd.DataFrame(df.values.reshape(-1,3), columns=list('XYZ'))
df["Time"] = np.repeat(np.array(df_time), N)

print df.shape

# Find the Euclidean distance (2-norm)
N_lim = int(0.5*N*(N-1))
result_index = ['D{}'.format(coord) for coord in range(1,N_lim+1)]
two_norm = df.groupby('Time')[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g), index=result_index))

print two_norm.shape

哪个输出(来自三个打印语句)

Which has the output (from the three print statements)

(4203, 36)
(50436, 4)
(4203, 66)

如您所见,最终结果的形状与应有的样子完全一样.但是,据我所知,这两组数据之间确实没有什么不同(据我所知),但数值上的差异应该不会对所得数据框的实际形状产生任何影响.

As you can see the shape of the final result is exactly as it should be. But there is truly nothing different (as far as I can tell) between these two sets of data, bar the numerical differences which should not have any impact on the actual shape of the resulting data frame.

我想念什么?

谢谢.

可以在此处找到原始数据(本文第一部分中使用的原始数据): https://www.dropbox.com/sh/80f8ue4ffa4067t/Pntl5-gUW4

Original data can be found here (the one used in the first part of this post): https://www.dropbox.com/sh/80f8ue4ffa4067t/Pntl5-gUW4

请注意,在保管箱中找到的.csv文件是数据框df_f_2_norm-因此它不是原始数据,而是经过重整形的版本(因此上面代码的第一行不是需要执行才能达到此状态,因为它已经执行过.

It should be noted that the .csv file found in the dropbox is the data frame df_f_2_norm - hence it is not the raw data, but the re-shaped version (so the first line of code in above, does not need to be executed to get to this state, as it has already been performed).

推荐答案

如果运行以下代码

df_f_2_norm.Time.value_counts()

然后您会发现并非所有时间值都有12行.

Then you can find that not all time value has 12 rows.

以下是输出:

1.333    492
1.383    492
1.317    492
1.400    492
1.467    492
1.450    492
1.483    492
1.417    492
1.500    492
1.367    492
1.350    492
1.433    492
1.533    480
1.517    480
1.550    468
...
4.800    12
4.600    12
4.750    12
4.833    12
4.667    12
4.700    12
4.650    12
4.683    12
4.633    12
4.617    12
4.817    12
4.583    12
4.733    12
4.767    12
4.783    12
Length: 272, dtype: int64

如果要每12行对数据框进行分组,则可以:

If you want to group the dataframe every 12 rows, you can:

import pandas as pd
from scipy.spatial.distance import pdist, squareform

df_f_2_norm = pd.read_csv("astrid_data.csv")
g = np.repeat(np.arange(df_f_2_norm.shape[0]//12), 12)

N = 12

N_lim = int(0.5*N*(N-1)) 
result_index = ['D{}'.format(tag) for tag in range(1,N_lim+1)] # Column labels
two_norm = df_f_2_norm.groupby(g)[["X", "Y", "Z"]].apply(lambda g: pd.Series(pdist(g), index=result_index))

这篇关于当函数应用于行时, pandas 的.groupby函数出现奇怪问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆