如何合并DataFrame,以便将一个与* date *相对应的值应用于另一个日期的所有* times? [英] How can DataFrames be merged such that the values of one that correspond to *dates* get applied to all *times* of all dates of the other?

查看:94
本文介绍了如何合并DataFrame,以便将一个与* date *相对应的值应用于另一个日期的所有* times?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个DataFrame.一个具有一组与某些时间和日期相对应的值(df_1).另一个具有对应于某些日期(df_2)的一组值.我想合并这些DataFrame,以使日期的df_2值适用于相应日期的df_1的所有时间.

I've got two DataFrames. One has a set of values corresponding to certain times and dates (df_1). The other has a set of values corresponding to certain dates (df_2). I want to merge these DataFrames such that the values of df_2 for dates get applied to all times of df_1 for the corresponding dates.

所以,这里是df_1:

|DatetimeIndex          |value_1|
|-----------------------|-------|
|2015-07-18 13:53:33.280|10     |
|2015-07-18 15:43:30.111|11     |
|2015-07-19 13:54:03.330|12     |
|2015-07-20 13:52:13.350|13     |
|2015-07-20 16:10:01.901|14     |
|2015-07-20 16:50:55.020|15     |
|2015-07-21 13:56:03.126|16     |
|2015-07-22 13:53:51.747|17     |
|2015-07-22 19:45:14.647|18     |
|2015-07-23 13:53:29.346|19     |
|2015-07-23 20:00:30.100|20     |

这是df_2:

|DatetimeIndex|value_2|
|-------------|-------|
|2015-07-18   |100    |
|2015-07-19   |200    |
|2015-07-20   |300    |
|2015-07-21   |400    |
|2015-07-22   |500    |
|2015-07-23   |600    |

我想像这样合并它们:

|DatetimeIndex          |value_1|value_2|
|-----------------------|-------|-------|
|2015-07-18 00:00:00.000|NaN    |100    |
|2015-07-18 13:53:33.280|10.0   |100    |
|2015-07-18 15:43:30.111|11.0   |100    |
|2015-07-19 00:00:00.000|NaN    |200    |
|2015-07-19 13:54:03.330|12.0   |200    |
|2015-07-20 00:00:00.000|NaN    |300    |
|2015-07-20 13:52:13.350|13.0   |300    |
|2015-07-20 16:10:01.901|14.0   |300    |
|2015-07-20 16:50:55.020|15.0   |300    |
|2015-07-21 00:00:00.000|NaN    |400    |
|2015-07-21 13:56:03.126|16.0   |400    |
|2015-07-22 00:00:00.000|NaN    |500    |
|2015-07-22 13:53:51.747|17     |500    |
|2015-07-22 19:45:14.647|18     |500    |
|2015-07-23 00:00:00.000|NaN    |600    |
|2015-07-23 13:53:29.346|19     |600    |
|2015-07-23 20:00:30.100|20     |600    |

所以value_2一直存在.

这叫什么样的合并?怎么办?

What kind of merge is this called? How can it be done?

DataFrames的代码如下:

Code for the DataFrames is as follows:

import pandas as pd

df_1 = pd.DataFrame(
    [
        [pd.Timestamp("2015-07-18 13:53:33.280"), 10],
        [pd.Timestamp("2015-07-18 15:43:30.111"), 11],
        [pd.Timestamp("2015-07-19 13:54:03.330"), 12],
        [pd.Timestamp("2015-07-20 13:52:13.350"), 13],
        [pd.Timestamp("2015-07-20 16:10:01.901"), 14],
        [pd.Timestamp("2015-07-20 16:50:55.020"), 15],
        [pd.Timestamp("2015-07-21 13:56:03.126"), 16],
        [pd.Timestamp("2015-07-22 13:53:51.747"), 17],
        [pd.Timestamp("2015-07-22 19:45:14.647"), 18],
        [pd.Timestamp("2015-07-23 13:53:29.346"), 19],
        [pd.Timestamp("2015-07-23 20:00:30.100"), 20]
    ],
    columns = [
        "datetime",
        "value_1"
    ]
)
df_1.index = df_1["datetime"]
del df_1["datetime"]
df_1.index = pd.to_datetime(df_1.index.values)

df_2 = pd.DataFrame(
    [
        [pd.Timestamp("2015-07-18 00:00:00"), 100],
        [pd.Timestamp("2015-07-19 00:00:00"), 200],
        [pd.Timestamp("2015-07-20 00:00:00"), 300],
        [pd.Timestamp("2015-07-21 00:00:00"), 400],
        [pd.Timestamp("2015-07-22 00:00:00"), 500],
        [pd.Timestamp("2015-07-23 00:00:00"), 600]
    ],
    columns = [
        "datetime",
        "value_2"
    ]
)
df_2
df_2.index = df_2["datetime"]
del df_2["datetime"]
df_2.index = pd.to_datetime(df_2.index.values)

推荐答案

解决方案
构造一个新的索引,将两者结合起来.然后结合使用reindexmap

Solution
Construct a new index that is a union of the two. Then use a combination of reindex and map

idx = df_1.index.union(df_2.index)

df_1.reindex(idx).assign(value_2=idx.floor('D').map(df_2.value_2.get))

                         value_1  value_2
2015-07-18 00:00:00.000      NaN      100
2015-07-18 13:53:33.280     10.0      100
2015-07-18 15:43:30.111     11.0      100
2015-07-19 00:00:00.000      NaN      200
2015-07-19 13:54:03.330     12.0      200
2015-07-20 00:00:00.000      NaN      300
2015-07-20 13:52:13.350     13.0      300
2015-07-20 16:10:01.901     14.0      300
2015-07-20 16:50:55.020     15.0      300
2015-07-21 00:00:00.000      NaN      400
2015-07-21 13:56:03.126     16.0      400
2015-07-22 00:00:00.000      NaN      500
2015-07-22 13:53:51.747     17.0      500
2015-07-22 19:45:14.647     18.0      500
2015-07-23 00:00:00.000      NaN      600
2015-07-23 13:53:29.346     19.0      600
2015-07-23 20:00:30.100     20.0      600


说明

  • 将两者结合起来是不言而喻的.但是,当采用并集时,我们也会自动获得排序索引.那很方便!
  • 当我们使用这个经过改进的新索引重新索引df_1时,某些索引值将不会出现在df_1的索引中.在不指定其他参数的情况下,那些以前不存在的索引的列值将是np.nan,这就是我们想要的.
  • 我使用assign添加列.
    • 我认为它更干净
    • 它不会覆盖正在使用的数据框
    • 管道良好
    • Taking the union of the two should be self explanatory. However, when taking the union, we automatically get a sorted index as well. That's convenient!
    • When we reindex df_1 with the this new and improved union of indices, some of the index values will not be present in the index of df_1. Without specifying other parameters, the column values for those previously non-existent indices will be np.nan, which is what we were going for.
    • I use assign to add columns.
      • I think it's cleaner
      • It doesn't overwrite the dataframe I'm working with
      • It pipelines well

      回复评论
      假设df_2有几列.我们可以改用join

      Response to Comment
      Suppose df_2 has several columns. We could use join instead

      df_1.join(df_2.loc[idx.date].set_index(idx), how='outer')
      
                               value_1  value_2
      2015-07-18 00:00:00.000      NaN      100
      2015-07-18 13:53:33.280     10.0      100
      2015-07-18 15:43:30.111     11.0      100
      2015-07-19 00:00:00.000      NaN      200
      2015-07-19 13:54:03.330     12.0      200
      2015-07-20 00:00:00.000      NaN      300
      2015-07-20 13:52:13.350     13.0      300
      2015-07-20 16:10:01.901     14.0      300
      2015-07-20 16:50:55.020     15.0      300
      2015-07-21 00:00:00.000      NaN      400
      2015-07-21 13:56:03.126     16.0      400
      2015-07-22 00:00:00.000      NaN      500
      2015-07-22 13:53:51.747     17.0      500
      2015-07-22 19:45:14.647     18.0      500
      2015-07-23 00:00:00.000      NaN      600
      2015-07-23 13:53:29.346     19.0      600
      2015-07-23 20:00:30.100     20.0      600
      

      这似乎是一个更好的答案,因为它更短.但是对于单列情况,它的速度较慢.务必将其用于多列情况.

      This may seem like a better answer in that it is shorter. But it is slower for the single column case. By all means, use it for the multi-column case.

      %timeit df_1.reindex(idx).assign(value_2=idx.floor('D').map(df_2.value_2.get))
      %timeit df_1.join(df_2.loc[idx.date].set_index(idx), how='outer')
      
      1.56 ms ± 69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
      2.38 ms ± 591 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
      

      这篇关于如何合并DataFrame,以便将一个与* date *相对应的值应用于另一个日期的所有* times?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆