如何通过不在数据框中的数组对数据框进行排序 [英] How do I sort a dataframe by by an array not in the dataframe
问题描述
我已经以不同的背景为幌子多次回答了这个问题,并且我意识到在任何地方都没有一种好的规范方法.
I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.
因此,要设置一个简单的问题:
So, to set up a simple problem:
df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)
A B
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
5 5 2
问题:
如何按列'A'
和'B'
的乘积进行排序?
这是一种向数据框添加临时列的方法,先将其用于sort_values
,然后用于drop
.
Question:
How do I sort by the product of columns 'A'
and 'B'
?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values
then drop
it.
df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
是否有更好,更简洁,更清晰,更一致的方法?
推荐答案
TL; DR
iloc
+ argsort
TL;DR
iloc
+ argsort
我们可以使用 iloc
,在这里我们可以按顺序排列数组并返回按这些位置重新排序的数据框.
We can approach this using iloc
where we can take an array of ordinal positions and return the dataframe reordered by these positions.
具有 的功能> iloc
,我们可以 sort
使用任何指定顺序的数组.
With the power of iloc
, we can sort
with any array that specifies the order.
现在,我们要做的就是确定一种获得此排序的方法.原来有一种方法叫做 argsort
正是这样做的.通过传递 argsort
到 iloc
,我们可以整理出数据框.
Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort
which does exactly this. By passing the results of argsort
to iloc
, we can get our dataframe sorted out.
使用上面指定的问题
df.iloc[df.prod(1).argsort()]
与上述结果相同
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
那是为了简单起见.如果性能存在问题,我们可以采取进一步措施,并专注于numpy
That was for simplicity. We could take this further if performance is an issue and focus on numpy
v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)
这些解决方案有多快?
我们可以看到pd_ext_sort
是最简洁的,但缩放性却不如其他.
np_ext_sort
以透明性为代价提供最佳性能.不过,我认为目前还很清楚.
We can see that pd_ext_sort
is the most concise but does not scale as well as the others.
np_ext_sort
gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.
回测设置
backtest setup
def add_drop():
return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
def pd_ext_sort():
return df.iloc[df.prod(1).argsort()]
def np_ext_sort():
v = df.values
a = v.prod(1).argsort()
return pd.DataFrame(v[a], df.index[a], df.columns)
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot()
示例2
假设我有一列负值和正值.我想通过增加幅度进行排序...但是,我希望负面因素排在第一位.
Example 2
Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.
假设我有数据框df
df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)
A
0 -2
1 -1
2 0
3 1
4 2
我将再次设置3个版本.这次,我将使用np.lexsort
,它返回与argsort
相同类型的数组.意思是,我可以用它来重新排列数据框.
I'll set up 3 versions again. This time I'll use np.lexsort
which returns the same type of array as argsort
. Meaning, I can use it to reorder the dataframe.
注意: np.lexsort
首先按列表中的最后一个数组排序. \ shurg
Caveat: np.lexsort
sorts by the last array in its list first. \shurg
def add_drop():
return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)
def pd_ext_sort():
v = df.A.values
return df.iloc[np.lexsort([np.abs(v), v >= 0])]
def np_ext_sort():
v = df.A.values
a = np.lexsort([np.abs(v), v >= 0])
return pd.DataFrame(v[a, None], df.index[a], df.columns)
所有返回
A
1 -1
0 -2
2 0
3 1
4 2
这次有多快?
在此示例中,pd_ext_sort
和np_ext_sort
均胜过add_drop
.
In this example, both pd_ext_sort
and np_ext_sort
outperformed add_drop
.
回测设置
backtest setup
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot(figsize=(15, 6))
这篇关于如何通过不在数据框中的数组对数据框进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!