根据另一个数据框中的值对数据框行进行独立排序 [英] Sort dataframe rows independently by values in another dataframe
问题描述
假设两个数据框:
import pandas as pd
import numpy as np
d1 = {}
d2 = {}
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
我想按dat2中的行对dat1的行进行排序,并从dat1中提取有序数据的子集.下面是一个示例,其中从dat1中提取每行的前5个值. 例如,使用:
I want to sort dat1's rows by the rows in dat2 and extract a subset of the ordered data from dat1. Below, is an example where the top 5 values per row are extracted from dat1. For example, with:
A B C D E F
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.25395
2015-02-28 -0.330870 -1.168279 -0.042419 -0.232108 -0.042166 0.42985
A2 B2 C2 D2 E2 F2
2015-01-31 47 47 82 66 64 40
2015-02-28 30 16 60 57 77 74
我会得到:
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 A D C F E
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419 0.429850 -0.042166
这是我的解决方案.最大的问题是该代码无法处理dat1或dat2中的NA值,这是一个需要解决的巨大问题.
Here is my solution. The biggest issue is that this code does not deal with NA values either in dat1 or dat2 which is an enormous issue that needs to be fixed.
def sortByAnthr(X,Y):
return([x for (x,y) in sorted(zip(X,Y), key=lambda pair: pair[1])])
def r_selectr(dat2,dat1, n):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x.index,dat2.loc[x.name,:]),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign column names
ordr_r = ordr_cols.apply(lambda x: dat1.ix[x.name,x.values].tolist(),axis=1)
return([ordr_cols, ordr_r])
ordr_cols,ordr_r = r_selectr(dat2,dat1,5)
ordr_cols.iloc[:2,:]
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 A D C F E
ordr_r.iloc[:2,:]
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419 0.429850 -0.042166
例如,对于NA,上述内容无法正确排序:
For example, with NAs, the above fails to sort correctly:
dat1.iloc[[1,2],[1,3,5]]=np.nan
dat2.iloc[[1,4],[2,4,5]]=np.nan
推荐答案
这是我的解决方案.现在,它通过与dat1和dat2中的每行非NA值的索引相交来处理NA.但是,这会导致应用问题,即应用需要为每一行使用相同大小的输出.填充无法/未排序的项目的功能是fillVacuum.
Here is my solution. It now handles NAs by intersecting the indexes of non-NA values in dat1 and dat2 for each row. This, however, introduces an issue in apply, whereby apply needs same-sized output for each row. The function that fills items that cannot/were not sorted is fillVacuum.
def fillVacuum(toFill,MatchLengthOf):
if len(toFill)<len(MatchLengthOf):
[toFill.insert(i, np.nan) for i in range(len(MatchLengthOf)-len(toFill))]
return()
def sortByAnthr(X,Y,Xindex):
#intersect non-na column indexes between two datasets
idx = np.intersect1d(X.notnull().nonzero()[0],Y.notnull().nonzero()[0])
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex[idx],Y[idx]), key=lambda pair: pair[1])]
#due to molding that'll happen later in apply, it is necessary to fill removed indexes
fillVacuum(ordrX, Xindex)
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
fillVacuum(ordrd_row, row)
return(ordrd_row)
def r_selectr(dat2,dat1, n):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
ordr_cols,ordr_r = r_selectr(dat2,dat1,5)
这些函数产生以下内容:
These functions yield the following:
dat1.iloc[:2,:]
A B C D E F
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.253954
2015-02-28 NaN NaN -0.042419 -0.232108 NaN 0.429850
dat2.iloc[:2,:]
A2 B2 C2 D2 E2 F2
2015-01-31 47 47 82 66 64 40
2015-02-28 NaN 16 60 57 77 NaN
ordr_cols.iloc[:2,:]
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 NaN NaN NaN D C
ordr_r.iloc[:2,:]
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 NaN NaN NaN -0.232108 -0.042419
这篇关于根据另一个数据框中的值对数据框行进行独立排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!