根据另一个数据框中的值对数据框行进行独立排序 [英] Sort dataframe rows independently by values in another dataframe

查看:118
本文介绍了根据另一个数据框中的值对数据框行进行独立排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设两个数据框:

import pandas as pd
import numpy as np

d1 = {}
d2 = {}

np.random.seed(5)
for col in list("ABCDEF"):
    d1[col] = np.random.randn(12)
    d2[col+'2'] = np.random.random_integers(0,100, 12)

t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")

dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)

我想按dat2中的行对dat1的行进行排序,并从dat1中提取有序数据的子集.下面是一个示例,其中从dat1中提取每行的前5个值. 例如,使用:

I want to sort dat1's rows by the rows in dat2 and extract a subset of the ordered data from dat1. Below, is an example where the top 5 values per row are extracted from dat1. For example, with:

                   A         B         C         D         E        F
2015-01-31  0.441227 -0.817548 -0.723062 -0.205149  0.230843 -0.25395
2015-02-28 -0.330870 -1.168279 -0.042419 -0.232108 -0.042166  0.42985

            A2  B2  C2  D2  E2  F2
2015-01-31  47  47  82  66  64  40
2015-02-28  30  16  60  57  77  74

我会得到:

            0  1  2  3  4
2015-01-31  A  B  E  D  C
2015-02-28  A  D  C  F  E
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419  0.429850 -0.042166

这是我的解决方案.最大的问题是该代码无法处理dat1或dat2中的NA值,这是一个需要解决的巨大问题.

Here is my solution. The biggest issue is that this code does not deal with NA values either in dat1 or dat2 which is an enormous issue that needs to be fixed.

def sortByAnthr(X,Y):
    return([x for (x,y) in sorted(zip(X,Y), key=lambda pair: pair[1])])

def r_selectr(dat2,dat1, n):
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x.index,dat2.loc[x.name,:]),axis=1).iloc[:,-n:]
    ordr_cols.columns = list(range(0,n)) #assign column names

    ordr_r = ordr_cols.apply(lambda x: dat1.ix[x.name,x.values].tolist(),axis=1)
    return([ordr_cols, ordr_r])

ordr_cols,ordr_r = r_selectr(dat2,dat1,5)

ordr_cols.iloc[:2,:]
            0  1  2  3  4
2015-01-31  A  B  E  D  C
2015-02-28  A  D  C  F  E

ordr_r.iloc[:2,:]
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419  0.429850 -0.042166

例如,对于NA,上述内容无法正确排序:

For example, with NAs, the above fails to sort correctly:

dat1.iloc[[1,2],[1,3,5]]=np.nan
dat2.iloc[[1,4],[2,4,5]]=np.nan

推荐答案

这是我的解决方案.现在,它通过与dat1和dat2中的每行非NA值的索引相交来处理NA.但是,这会导致应用问题,即应用需要为每一行使用相同大小的输出.填充无法/未排序的项目的功能是fillVacuum.

Here is my solution. It now handles NAs by intersecting the indexes of non-NA values in dat1 and dat2 for each row. This, however, introduces an issue in apply, whereby apply needs same-sized output for each row. The function that fills items that cannot/were not sorted is fillVacuum.

def fillVacuum(toFill,MatchLengthOf):
    if len(toFill)<len(MatchLengthOf):
       [toFill.insert(i, np.nan) for i in range(len(MatchLengthOf)-len(toFill))]
    return()

def sortByAnthr(X,Y,Xindex):
    #intersect non-na column indexes between two datasets
    idx = np.intersect1d(X.notnull().nonzero()[0],Y.notnull().nonzero()[0])

    #order the subset of X.index by Y
    ordrX = [x for (x,y) in sorted(zip(Xindex[idx],Y[idx]), key=lambda pair: pair[1])]

    #due to molding that'll happen later in apply, it is necessary to fill removed indexes
    fillVacuum(ordrX, Xindex)

    return(ordrX)

def OrderRow(row,df):
    ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
    fillVacuum(ordrd_row, row)
    return(ordrd_row)

def r_selectr(dat2,dat1, n):
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index),axis=1).iloc[:,-n:]
    ordr_cols.columns = list(range(0,n)) #assign interpretable column names

    ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
    return([ordr_cols, ordr_r])

ordr_cols,ordr_r = r_selectr(dat2,dat1,5)

这些函数产生以下内容:

These functions yield the following:

dat1.iloc[:2,:]
                   A         B         C         D         E         F
2015-01-31  0.441227 -0.817548 -0.723062 -0.205149  0.230843 -0.253954
2015-02-28       NaN       NaN -0.042419 -0.232108       NaN  0.429850

dat2.iloc[:2,:]
            A2  B2  C2  D2  E2  F2
2015-01-31  47  47  82  66  64  40
2015-02-28 NaN  16  60  57  77 NaN

ordr_cols.iloc[:2,:]
              0    1    2  3  4
2015-01-31    A    B    E  D  C
2015-02-28  NaN  NaN  NaN  D  C

ordr_r.iloc[:2,:]
                   0         1         2         3         4
2015-01-31  0.441227 -0.817548  0.230843 -0.205149 -0.723062
2015-02-28       NaN       NaN       NaN -0.232108 -0.042419

这篇关于根据另一个数据框中的值对数据框行进行独立排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆