Dataframe分类排序优化问题2 [英] Dataframe classification and sorting optimization problem 2
问题描述
我之前问过一个排序问题,有人解决了它首先使用 DataFrame.sort_values
两列然后添加 GroupBy.head
.
I asked a sorting problem before, and someone solved it use DataFrame.sort_values
by both columns first and then add GroupBy.head
.
现在我遇到了一个更复杂的排序.我需要按 category
对数据框进行分类.每个category
在class的data2
的值最大时,根据data1
的值进行过滤,然后排序
Now I encounter a more complicated sorting. I need to classify the dataframe by category
. Each category
is filtered according to the value of data1
when the value of data2
of the class is the largest, and then sorted
代码如下,如何优化?
import numpy as np
import pandas as pd
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100
a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)
df = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)
category data1 data2
0 A 28.194042 98.813271
1 A 26.635099 82.768130
2 A 24.345177 80.558532
3 A 24.222105 89.596726
4 B 60.883981 98.444699
5 B 49.934815 90.319787
6 B 10.751913 86.124271
7 B 4.029914 89.802120
我用groupby,感觉代码太复杂了,能不能优化一下?
I use groupby, I feel the code is too complicated, can it be optimized?
import numpy as np
import pandas as pd
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100
a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)
df2 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
df3 = df.groupby('category').apply(lambda x: x[x['data1'].isin(x[x['data1'] <= x[x['data2'] == x['data2'].max()].data1.max()]['data1'].nlargest(4))]).reset_index(drop=True)
df3 = df3.sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print((df2.data1-df3.data1).max())
print((df2.data2-df3.data2).max())
0.0
0.0
推荐答案
使用:
df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100
a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)
b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)
df1 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df1)
category data1 data2
0 A 87.560430 99.262452
1 A 85.798945 99.200321
2 A 68.614311 97.796274
3 A 41.641961 95.544980
4 B 69.937691 99.711156
5 B 56.932784 99.227111
6 B 19.903620 94.389186
7 B 12.701288 98.455274
这里首先通过每组最大data2
获取所有data1
,通过<=
过滤,最后使用groupby.head代码>:
Here are first get all data1
by maximal data2
per groups, filtered by <=
and last used groupby.head
:
s = (df.sort_values('data2')
.drop_duplicates('category', keep='last')
.set_index('category')['data1'])
df = df[df['data1'] <= df['category'].map(s)]
df1 = (df.sort_values(by=['category', 'data1'], ascending=[True, False])
.groupby('category')
.head(4)
.reset_index(drop=True))
print (df1)
category data1 data2
0 A 87.560430 99.262452
1 A 85.798945 99.200321
2 A 68.614311 97.796274
3 A 41.641961 95.544980
4 B 69.937691 99.711156
5 B 56.932784 99.227111
6 B 12.701288 98.455274
7 B 19.903620 94.389186
这篇关于Dataframe分类排序优化问题2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!