如何通过不在数据框中的数组对数据框进行排序 [英] How do I sort a dataframe by by an array not in the dataframe

查看:87
本文介绍了如何通过不在数据框中的数组对数据框进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经以不同的背景为幌子多次回答了这个问题,并且我意识到在任何地方都没有一种好的规范方法.

I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.

因此,要设置一个简单的问题:

So, to set up a simple problem:

df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)

   A  B
0  0  1
1  1  2
2  2  1
3  3  2
4  4  1
5  5  2

问题:

如何按列'A''B' 的乘积进行排序?
这是一种向数据框添加临时列的方法,先将其用于sort_values,然后用于drop.

Question:

How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.

df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

   A  B
0  0  1
1  1  2
2  2  1
4  4  1
3  3  2
5  5  2


是否有更好,更简洁,更清晰,更一致的方法?

推荐答案

TL; DR
iloc + argsort

TL;DR
iloc + argsort

我们可以使用 iloc ,在这里我们可以按顺序排列数组并返回按这些位置重新排序的数据框.

We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.

具有 的功能> iloc ,我们可以 sort 使用任何指定顺序的数组.

With the power of iloc, we can sort with any array that specifies the order.

现在,我们要做的就是确定一种获得此排序的方法.原来有一种方法叫做 argsort 正是这样做的.通过传递 argsort iloc ,我们可以整理出数据框.

Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.

使用上面指定的问题

df.iloc[df.prod(1).argsort()]

与上述结果相同

   A  B
0  0  1
1  1  2
2  2  1
4  4  1
3  3  2
5  5  2

那是为了简单起见.如果性能存在问题,我们可以采取进一步措施,并专注于numpy

That was for simplicity. We could take this further if performance is an issue and focus on numpy

v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)


这些解决方案有多快?

我们可以看到pd_ext_sort是最简洁的,但缩放性却不如其他.
np_ext_sort以透明性为代价提供最佳性能.不过,我认为目前还很清楚.

We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.

回测设置

backtest setup

def add_drop():
    return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)

def pd_ext_sort():
    return df.iloc[df.prod(1).argsort()]

def np_ext_sort():
    v = df.values
    a = v.prod(1).argsort()
    return pd.DataFrame(v[a], df.index[a], df.columns)

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot()

示例2

假设我有一列负值和正值.我想通过增加幅度进行排序...但是,我希望负面因素排在第一位.

Example 2

Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.

假设我有数据框df

df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)

   A
0 -2
1 -1
2  0
3  1
4  2

我将再次设置3个版本.这次,我将使用np.lexsort,它返回与argsort相同类型的数组.意思是,我可以用它来重新排列数据框.

I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.

注意: np.lexsort首先按列表中的最后一个数组排序. \ shurg

Caveat: np.lexsort sorts by the last array in its list first. \shurg

def add_drop():
    return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)

def pd_ext_sort():
    v = df.A.values
    return df.iloc[np.lexsort([np.abs(v), v >= 0])]

def np_ext_sort():
    v = df.A.values
    a = np.lexsort([np.abs(v), v >= 0])
    return pd.DataFrame(v[a, None], df.index[a], df.columns)

所有返回

   A
1 -1
0 -2
2  0
3  1
4  2

这次有多快?

在此示例中,pd_ext_sortnp_ext_sort均胜过add_drop.

In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.

回测设置

backtest setup

results = pd.DataFrame(
    index=pd.Index([10, 100, 1000, 10000], name='Size'),
    columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)

for i in results.index:
    df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
    for j in results.columns:
        stmt = '{}()'.format(j)
        setup = 'from __main__ import df, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100))

results.plot(figsize=(15, 6))

这篇关于如何通过不在数据框中的数组对数据框进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆