用优雅的Pandas代码替换迭代 [英] Replace Iterations by elegant Pandas code

查看:67
本文介绍了用优雅的Pandas代码替换迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将我的旧学校代码更改为优雅/快速的Pandas代码,例如上一个问题之一:

I am trying to change my old school codes for the elegant/fast Pandas codes, such as the one of this previous question:

滚动操作速度较慢,无法创建新列

我想使用Pandas优雅而快速的代码来提高4种不同的代码性能.

I have 4 different codes that I would like to improve in performance with an elegant and fast code using Pandas.

原始数据帧(df)与此类似(尽管更大):

The original dataframe (df) is similar to this (although much larger):

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 1}, {'x': 5}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3},{'x': 1}, {'x': 6}, {'x': 3}, {'x': 5}, {'x': 2}, {'x': 3},{'x': 1}, {'x': 9}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3}, {'x': 6}, {'x': 8}, {'x': 2}, {'x': 7}, {'x': 9}]
df = pd.DataFrame(data, index=idx, columns=['x'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])

它看起来像这样:

                 x
date       type
2018-01-01 A     1
           B     5
2018-02-01 B     3
2018-03-01 A     2
           B     7
           C     3
2018-04-01 A     1
           B     6
2018-05-01 B     3
2018-06-01 A     5
           B     2
           C     3
2018-07-01 A     1
2018-08-01 B     9
           C     3
2018-09-01 A     2
           B     7
2018-10-01 C     3
           A     6
           B     8
2018-11-01 A     2
2018-12-01 B     7
           C     9

我的目标是改善这种缓慢的for循环代码.在我的代码下面:

My goal is to improve this slow for-loop code. Below my code:

df=df.reset_index()
df['y']=0
for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=np.mean(df['x'][df['date']==j])

结果如下:

         date type  x         y
0  2018-01-01    A  1  3.000000
1  2018-01-01    B  5  3.000000
2  2018-02-01    B  3  3.000000
3  2018-03-01    A  2  4.000000
4  2018-03-01    B  7  4.000000
5  2018-03-01    C  3  4.000000
6  2018-04-01    A  1  3.500000
7  2018-04-01    B  6  3.500000
8  2018-05-01    B  3  3.000000
9  2018-06-01    A  5  3.333333
10 2018-06-01    B  2  3.333333
11 2018-06-01    C  3  3.333333
12 2018-07-01    A  1  1.000000
13 2018-08-01    B  9  6.000000
14 2018-08-01    C  3  6.000000
15 2018-09-01    A  2  4.500000
16 2018-09-01    B  7  4.500000
17 2018-10-01    C  3  5.666667
18 2018-10-01    A  6  5.666667
19 2018-10-01    B  8  5.666667
20 2018-11-01    A  2  2.000000
21 2018-12-01    B  7  8.000000
22 2018-12-01    C  9  8.000000

**我尝试了下面的Pandas代码,但没有用(我仍然需要查看更多示例来弄清楚它是如何工作的):

** I tried the following Pandas code, but it doesn't work (I still need to see more examples to figure out how it works):

df['y'] = df.groupby('date')['x'].mean().reset_index(level=2, drop=True).swaplevel(0,1)

2)所有类型的日期观察(使用相同的数据框(df)):

我的目标是衡量每个日期的类型数.

2) Observations by date for all types (uses the same dataframe (df)):

My goal is to measure the number of types for every date.

我的慢速代码是:

df=df.reset_index()
df['y']=0
for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=len(df['type'][df['date']==j])

结果如下:

         date type  x  y
0  2018-01-01    A  1  2
1  2018-01-01    B  5  2
2  2018-02-01    B  3  1
3  2018-03-01    A  2  3
4  2018-03-01    B  7  3
5  2018-03-01    C  3  3
6  2018-04-01    A  1  2
7  2018-04-01    B  6  2
8  2018-05-01    B  3  1
9  2018-06-01    A  5  3
10 2018-06-01    B  2  3
11 2018-06-01    C  3  3
12 2018-07-01    A  1  1
13 2018-08-01    B  9  2
14 2018-08-01    C  3  2
15 2018-09-01    A  2  2
16 2018-09-01    B  7  2
17 2018-10-01    C  3  3
18 2018-10-01    A  6  3
19 2018-10-01    B  8  3
20 2018-11-01    A  2  1
21 2018-12-01    B  7  2
22 2018-12-01    C  9  2

3)"A"类型的日期观察(使用相同的数据框(df)):

我的目标是测量每个日期的类型A的数量.

3) Observations by date for 'A' type (uses the same dataframe (df)):

My goal is to measure the number of type A for every date.

我的慢速代码如下:

df=df.reset_index()
df['z']=0
df['y']=0

for index,row in df.iterrows():
    if row['type']=='A':
        df['z'][index]=1
    else:
        df['z'][index]=0

for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=sum(df['z'][df['date']==j])

del df['z']

它看起来像这样:

         date type  x  y
0  2018-01-01    A  1  1
1  2018-01-01    B  5  1
2  2018-02-01    B  3  0
3  2018-03-01    A  2  1
4  2018-03-01    B  7  1
5  2018-03-01    C  3  1
6  2018-04-01    A  1  1
7  2018-04-01    B  6  1
8  2018-05-01    B  3  0
9  2018-06-01    A  5  1
10 2018-06-01    B  2  1
11 2018-06-01    C  3  1
12 2018-07-01    A  1  1
13 2018-08-01    B  9  0
14 2018-08-01    C  3  0
15 2018-09-01    A  2  1
16 2018-09-01    B  7  1
17 2018-10-01    C  3  1
18 2018-10-01    A  6  1
19 2018-10-01    B  8  1
20 2018-11-01    A  2  1
21 2018-12-01    B  7  0
22 2018-12-01    C  9  0

4)对于其z值= 1的"A"类型的观测值:

请使用以下数据框(df1):

4) Observations for 'A' type for which its z value = 1:

Please use the following dataframe (df1):

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 10, 'z': 1}, {'x': 50, 'z': 0}, {'x': 30, 'z': 0}, {'x': 20, 'z': 0}, {'x': 70, 'z': 1}, {'x': 30, 'z': 1},{'x': 10, 'z': 1}, {'x': 60, 'z': 0}, {'x': 30, 'z': 0}, {'x': 50, 'z': 1}, {'x': 20, 'z': 0}, {'x': 30, 'z': 1},{'x': 10, 'z': 0}, {'x': 90, 'z': 1}, {'x': 30, 'z': 1}, {'x': 20, 'z': 1}, {'x': 70, 'z': 0}, {'x': 30, 'z': 0}, {'x': 60, 'z': 1}, {'x': 80, 'z': 1}, {'x': 20, 'z': 0}, {'x': 70, 'z': 0}, {'x': 90, 'z': 1}]
df1 = pd.DataFrame(data, index=idx, columns=['x','z'])
df1.index.names=['date','type']
df1=df1.reset_index()
df1['date'] = pd.to_datetime(df1['date'],format = '%b-%y')
df1=df1.set_index(['date','type'])

此数据帧(df1)如下所示:

This dataframe (df1) looks like this:

                  x  z
date       type
2018-01-01 A     10  1
           B     50  0
2018-02-01 B     30  0
2018-03-01 A     20  0
           B     70  1
           C     30  1
2018-04-01 A     10  1
           B     60  0
2018-05-01 B     30  0
2018-06-01 A     50  1
           B     20  0
           C     30  1
2018-07-01 A     10  0
2018-08-01 B     90  1
           C     30  1
2018-09-01 A     20  1
           B     70  0
2018-10-01 C     30  0
           A     60  1
           B     80  1
2018-11-01 A     20  0
2018-12-01 B     70  0
           C     90  1

我的慢速代码是:

df1=df1.reset_index()
df1['h']=0
df1['k']=0
df1['y']=0

for index,row in df1.iterrows():
    if row['type']=='A':
        df1['h'][index]=1
    else:
        df1['h'][index]=0

for index,row in df1.iterrows():
    if row['z']==1 and row['h']==1:
        df1['k'][index]=1
    else:
        df1['k'][index]=0   

for j in df1['date'].unique():
    list_1=list(df1['type'][df1['date']==j].index)
    df1['y'][list_1]=sum(df1['k'][df1['date']==j])

del df1['h']
del df1['k']

它看起来像这样:

         date type   x  z  y
0  2018-01-01    A  10  1  1
1  2018-01-01    B  50  0  1
2  2018-02-01    B  30  0  0
3  2018-03-01    A  20  0  0
4  2018-03-01    B  70  1  0
5  2018-03-01    C  30  1  0
6  2018-04-01    A  10  1  1
7  2018-04-01    B  60  0  1
8  2018-05-01    B  30  0  0
9  2018-06-01    A  50  1  1
10 2018-06-01    B  20  0  1
11 2018-06-01    C  30  1  1
12 2018-07-01    A  10  0  0
13 2018-08-01    B  90  1  0
14 2018-08-01    C  30  1  0
15 2018-09-01    A  20  1  1
16 2018-09-01    B  70  0  1
17 2018-10-01    C  30  0  1
18 2018-10-01    A  60  1  1
19 2018-10-01    B  80  1  1
20 2018-11-01    A  20  0  0
21 2018-12-01    B  70  0  0
22 2018-12-01    C  90  1  0

尽管问题似乎很久,但我知道Pandas的答案可以用几行代码来解决.如果您可以让我知道您的代码比我的代码快多少,那将非常有用.

Although the question seems long I know that the answers with Pandas can be in a few lines of code. If you can let me know how much faster are your codes than mine it would be very useful.

推荐答案

您正在寻找groupby + transform. .transform是关键,因为它将结果广播回属于该组的原始DataFrame中的所有行.

You are looking for groupby + transform. Here the .transform is key as it broadcasts the result back to all rows in the original DataFrame that belong to that group.

首先,我们可以简单地在x上进行转换.

For the first we can transform on x, simply.

对于唯一类型,将其带到列中比处理索引要快,所以assign列并计算组中唯一值的数量.

For the unique types, it's faster to bring it to a column than it is to deal with the index, so assign a column and calculate the # of unique values within group.

对于最后两个条件,您可以创建一个布尔列,以查看该行是否满足条件,并在组内求和.

For the final two conditions, you can create a Boolean column to see if that row satisfies the condition, which you sum within groups.

#1 Get the mean of `x` by date
df['x_avg'] = df.groupby('date').x.transform('mean')

#2 Get the # of unique types. 
df['N'] = (df.assign(TYPE = df.index.get_level_values('type'))
             .groupby('date').TYPE.transform('nunique'))

#3 Get the number of Type == A within a group
df['num_A']  = (df.assign(eqA = (df.index.get_level_values('type') == 'A'))
                  .groupby('date').eqA.transform(sum).astype(int))

#4 Really just a slight extension of 3
df1['cond_4']  = (df1.assign(to_sum = ((df1.index.get_level_values('type') == 'A')
                                       &  (df1.z == 1)).astype(int))
                     .groupby('date').to_sum.transform(sum))

输出df:

                 x  num_A     x_avg  N
date       type                       
2018-01-01 A     1      1  3.000000  2
           B     5      1  3.000000  2
2018-02-01 B     3      0  3.000000  1
2018-03-01 A     2      1  4.000000  3
           B     7      1  4.000000  3
           C     3      1  4.000000  3
2018-04-01 A     1      1  3.500000  2
           B     6      1  3.500000  2
2018-05-01 B     3      0  3.000000  1
2018-06-01 A     5      1  3.333333  3
           B     2      1  3.333333  3
           C     3      1  3.333333  3
2018-07-01 A     1      1  1.000000  1
2018-08-01 B     9      0  6.000000  2
           C     3      0  6.000000  2
2018-09-01 A     2      1  4.500000  2
           B     7      1  4.500000  2
2018-10-01 C     3      1  5.666667  3
           A     6      1  5.666667  3
           B     8      1  5.666667  3
2018-11-01 A     2      1  2.000000  1
2018-12-01 B     7      0  8.000000  2
           C     9      0  8.000000  2

这篇关于用优雅的Pandas代码替换迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆