如何在python Pandas中执行条件连接/解决方法? [英] How to do/workaround a conditional join in python Pandas?

查看:85
本文介绍了如何在python Pandas中执行条件连接/解决方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图根据存储在单独表中的日期值来计算熊猫中基于时间的聚合。

I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables.

第一个表的顶部table_a看起来像这样:

The top of the first table table_a looks like this:

    COMPANY_ID  DATE            MEASURE
    1   2010-01-01 00:00:00     10
    1   2010-01-02 00:00:00     10
    1   2010-01-03 00:00:00     10
    1   2010-01-04 00:00:00     10
    1   2010-01-05 00:00:00     10

下面是创建表的代码:

    table_a = pd.concat(\
    [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
    'COMPANY_ID': 1 , 'MEASURE': 10}),\
    pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\
    'COMPANY_ID': 2 , 'MEASURE': 10})])

第二个表table_b看起来像这样:

The second table, table_b looks like this:

        COMPANY     END_DATE
        1   2010-03-01 00:00:00
        1   2010-06-02 00:00:00
        2   2010-03-01 00:00:00
        2   2010-06-02 00:00:00

创建它的代码是:

    table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\
                    'COMPANY':(1,1,2,2)})

我希望能够获得table_b中END_DATE之前每个30天期间每个COMPANY_ID的度量值列的总和。

I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b.

(我认为)这是SQL等效项:

This is (I think) the SQL equivalent:

      select
 b.COMPANY_ID,
 b.DATE
 sum(a.MEASURE) AS MEASURE_TO_END_DATE
 from table_a a, table_b b
 where a.COMPANY = b.COMPANY and
       a.DATE < b.DATE and
       a.DATE > b.DATE - 30  
 group by b.COMPANY;

感谢您的帮助

推荐答案

好吧,我可以想到几种方法:

Well, I can think of a few ways:


  1. 通过仅合并精确的数据框就实质上炸毁了数据框字段( company )...,然后在合并后的30天窗口中进行过滤。

  1. essentially blow up the dataframe by just merging on the exact field (company)... then filter on the 30-day windows after the merge.



  • 应该很快,但是会占用大量内存


    1. 在30天的窗口中将合并和过滤移至 groupby()

    >


    • 对每个组进行合并,因此速度较慢,但​​应使用较少的内存

    • 选项#1

      假设您的数据如下所示(我扩展了示例数据):

      Suppose your data looks like the following (I expanded your sample data):

      print df
      
          company       date  measure
      0         0 2010-01-01       10
      1         0 2010-01-15       10
      2         0 2010-02-01       10
      3         0 2010-02-15       10
      4         0 2010-03-01       10
      5         0 2010-03-15       10
      6         0 2010-04-01       10
      7         1 2010-03-01        5
      8         1 2010-03-15        5
      9         1 2010-04-01        5
      10        1 2010-04-15        5
      11        1 2010-05-01        5
      12        1 2010-05-15        5
      
      print windows
      
         company   end_date
      0        0 2010-02-01
      1        0 2010-03-15
      2        1 2010-04-01
      3        1 2010-05-15
      

      创建一个30天窗口的开始日期:

      Create a beginning date for the 30 day windows:

      windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                             np.timedelta64(30,'D'))
      print windows
      
         company   end_date   beg_date
      0        0 2010-02-01 2010-01-02
      1        0 2010-03-15 2010-02-13
      2        1 2010-04-01 2010-03-02
      3        1 2010-05-15 2010-04-15
      

      现在进行合并,然后选择根据日期是否在 beg_date end_date 之内而得出:

      Now do a merge and then select based on if date falls within beg_date and end_date:

      df = df.merge(windows,on='company',how='left')
      df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
      print df
      
          company       date  measure   end_date   beg_date
      2         0 2010-01-15       10 2010-02-01 2010-01-02
      4         0 2010-02-01       10 2010-02-01 2010-01-02
      7         0 2010-02-15       10 2010-03-15 2010-02-13
      9         0 2010-03-01       10 2010-03-15 2010-02-13
      11        0 2010-03-15       10 2010-03-15 2010-02-13
      16        1 2010-03-15        5 2010-04-01 2010-03-02
      18        1 2010-04-01        5 2010-04-01 2010-03-02
      21        1 2010-04-15        5 2010-05-15 2010-04-15
      23        1 2010-05-01        5 2010-05-15 2010-04-15
      25        1 2010-05-15        5 2010-05-15 2010-04-15
      

      您可以通过对公司和结束日期

      print df.groupby(['company','end_date']).sum()
      
                          measure
      company end_date           
      0       2010-02-01       20
              2010-03-15       30
      1       2010-04-01       10
              2010-05-15       15
      

      选项#2 将所有合并到分组依据。这应该在内存上更好,但我认为应该慢得多:

      Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:

      windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
                             np.timedelta64(30,'D'))
      
      def cond_merge(g,windows):
          g = g.merge(windows,on='company',how='left')
          g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
          return g.groupby('end_date')['measure'].sum()
      
      print df.groupby('company').apply(cond_merge,windows)
      
      company  end_date  
      0        2010-02-01    20
               2010-03-15    30
      1        2010-04-01    10
               2010-05-15    15
      

      另一个选择现在,如果您的窗户永不重叠(例如在示例数据中),您可以执行以下类似操作,以免破坏数据框,但速度非常快:

      Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:

      windows['date'] = windows['end_date']
      
      df = df.merge(windows,on=['company','date'],how='outer')
      print df
      
          company       date  measure   end_date
      0         0 2010-01-01       10        NaT
      1         0 2010-01-15       10        NaT
      2         0 2010-02-01       10 2010-02-01
      3         0 2010-02-15       10        NaT
      4         0 2010-03-01       10        NaT
      5         0 2010-03-15       10 2010-03-15
      6         0 2010-04-01       10        NaT
      7         1 2010-03-01        5        NaT
      8         1 2010-03-15        5        NaT
      9         1 2010-04-01        5 2010-04-01
      10        1 2010-04-15        5        NaT
      11        1 2010-05-01        5        NaT
      12        1 2010-05-15        5 2010-05-15
      

      此合并实质上将您的窗口结束日期插入数据框中,然后回填结束日期(按组) )将为您提供一个结构,可轻松创建汇总窗口:

      This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

      df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())
      
      print df
      
          company       date  measure   end_date
      0         0 2010-01-01       10 2010-02-01
      1         0 2010-01-15       10 2010-02-01
      2         0 2010-02-01       10 2010-02-01
      3         0 2010-02-15       10 2010-03-15
      4         0 2010-03-01       10 2010-03-15
      5         0 2010-03-15       10 2010-03-15
      6         0 2010-04-01       10        NaT
      7         1 2010-03-01        5 2010-04-01
      8         1 2010-03-15        5 2010-04-01
      9         1 2010-04-01        5 2010-04-01
      10        1 2010-04-15        5 2010-05-15
      11        1 2010-05-01        5 2010-05-15
      12        1 2010-05-15        5 2010-05-15
      
      df = df[df.end_date.notnull()]
      df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
                         np.timedelta64(30,'D'))
      
      print df
      
         company       date  measure   end_date   beg_date
      0         0 2010-01-01       10 2010-02-01 2010-01-02
      1         0 2010-01-15       10 2010-02-01 2010-01-02
      2         0 2010-02-01       10 2010-02-01 2010-01-02
      3         0 2010-02-15       10 2010-03-15 2010-02-13
      4         0 2010-03-01       10 2010-03-15 2010-02-13
      5         0 2010-03-15       10 2010-03-15 2010-02-13
      7         1 2010-03-01        5 2010-04-01 2010-03-02
      8         1 2010-03-15        5 2010-04-01 2010-03-02
      9         1 2010-04-01        5 2010-04-01 2010-03-02
      10        1 2010-04-15        5 2010-05-15 2010-04-15
      11        1 2010-05-01        5 2010-05-15 2010-04-15
      12        1 2010-05-15        5 2010-05-15 2010-04-15
      
      df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
      print df.groupby(['company','end_date']).sum()
      
                          measure
      company end_date           
      0       2010-02-01       20
              2010-03-15       30
      1       2010-04-01       10
              2010-05-15       15
      

      另一种替代方法是将第一个数据帧重新采样为每日数据,然后在30天的窗口内计算rolling_sums;并选择您感兴趣的结尾的日期。这也可能会占用大量内存。

      Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

      这篇关于如何在python Pandas中执行条件连接/解决方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆