Dataframe分类排序优化问题2 [英] Dataframe classification and sorting optimization problem 2

查看:51
本文介绍了Dataframe分类排序优化问题2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前问过一个排序问题,有人解决了它首先使用 DataFrame.sort_values 两列然后添加 GroupBy.head.

I asked a sorting problem before, and someone solved it use DataFrame.sort_values by both columns first and then add GroupBy.head.

数据框分类排序优化问题

现在我遇到了一个更复杂的排序.我需要按 category 对数据框进行分类.每个category在class的data2的值最大时,根据data1的值进行过滤,然后排序

Now I encounter a more complicated sorting. I need to classify the dataframe by category. Each category is filtered according to the value of data1 when the value of data2 of the class is the largest, and then sorted

代码如下,如何优化?

import numpy as np
import pandas as pd

df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100

a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)

b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)

df = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df)

  category      data1      data2
0        A  28.194042  98.813271
1        A  26.635099  82.768130
2        A  24.345177  80.558532
3        A  24.222105  89.596726
4        B  60.883981  98.444699
5        B  49.934815  90.319787
6        B  10.751913  86.124271
7        B   4.029914  89.802120

我用groupby,感觉代码太复杂了,能不能优化一下?

I use groupby, I feel the code is too complicated, can it be optimized?

import numpy as np
import pandas as pd

df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100

a = df[df['category'] == 'A']
c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)

b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)

df2 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
df3 = df.groupby('category').apply(lambda x: x[x['data1'].isin(x[x['data1'] <= x[x['data2'] == x['data2'].max()].data1.max()]['data1'].nlargest(4))]).reset_index(drop=True)
df3 = df3.sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)

print((df2.data1-df3.data1).max())
print((df2.data2-df3.data2).max())

0.0
0.0

推荐答案

使用:

df = pd.DataFrame()
n = 200
df['category'] = np.random.choice(('A', 'B'), n)
df['data1'] = np.random.rand(len(df))*100
df['data2'] = np.random.rand(len(df))*100

a = df[df['category'] == 'A']

c = a[a['data2'] == a.data2.max()].data1.max()
a = a[a['data1'] <= c]
a = a.sort_values(by='data1', ascending=False).head(4)

b = df[df['category'] == 'B']
c = b[b['data2'] == b.data2.max()].data1.max()
b = b[b['data1'] <= c]
b = b.sort_values(by='data1', ascending=False).head(4)

df1 = pd.concat([a, b]).sort_values(by=['category', 'data1'], ascending=[True, False]).reset_index(drop=True)
print(df1)
  category      data1      data2
0        A  87.560430  99.262452
1        A  85.798945  99.200321
2        A  68.614311  97.796274
3        A  41.641961  95.544980
4        B  69.937691  99.711156
5        B  56.932784  99.227111
6        B  19.903620  94.389186
7        B  12.701288  98.455274

这里首先通过每组最大data2获取所有data1,通过<=过滤,最后使用groupby.head:

Here are first get all data1 by maximal data2 per groups, filtered by <= and last used groupby.head:

s = (df.sort_values('data2')
       .drop_duplicates('category', keep='last')
       .set_index('category')['data1'])
df = df[df['data1'] <= df['category'].map(s)]
df1 = (df.sort_values(by=['category', 'data1'], ascending=[True, False])
         .groupby('category')
         .head(4)
         .reset_index(drop=True))
print (df1)
  category      data1      data2
0        A  87.560430  99.262452
1        A  85.798945  99.200321
2        A  68.614311  97.796274
3        A  41.641961  95.544980
4        B  69.937691  99.711156
5        B  56.932784  99.227111
6        B  12.701288  98.455274
7        B  19.903620  94.389186

这篇关于Dataframe分类排序优化问题2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆