pandas :按日期时间切片数据框(可能不存在)和返回视图 [英] Pandas: Slice Dataframe by Datetime (that may not exist) and Return View

查看:66
本文介绍了 pandas :按日期时间切片数据框(可能不存在)和返回视图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的DataFrame,我想对其进行切片,以便可以对切片的数据帧执行一些计算,以便在原始数据中更新值.另外,我将数据帧按索引中可能不存在的开始时间和结束时间进行切片.下面是一个简化的示例,但我实际上将要根据不同的计算来更新许多列.

I have a large DataFrame which I would like to slice so that I can perform some calculations on the sliced dataframe so that the values are updated in the original. In addition I am slicing the dataframe by a start and end time that may not exist in the index. Below is a simplified example, but I will actually want to update a number of columns based on different calculations.

In [1]: df
Out[1]:

                         A        B         C
TIME
2014-01-02 14:00:00 -1.172285  1.706200    NaN
2014-01-02 14:05:00  0.039511 -0.320798    NaN
2014-01-02 14:10:00 -0.192179 -0.539397    NaN
2014-01-02 14:15:00 -0.475917 -0.280055    NaN
2014-01-02 14:20:00  0.163376  1.124602    NaN
2014-01-02 14:25:00 -2.477812  0.656750    NaN

我已经尝试过以下所有语句来创建sdf作为我的时间范围的视图:

I have tried all of the below statements to create sdf as view for my time range:

start = datetime.strptime('2014-01-02 14:07:00', '%Y-%m-%d %H:%M:%S')
end = datetime.strptime('2014-01-02 14:22:00', '%Y-%m-%d %H:%M:%S')

sdf = df[start:end]
sdf = df[start < df.index < end]
sdf = df.ix[start:end]
sdf = df.loc[start:end]
sdf = df.truncate(before=start, after=end, copy=False)

sdf[C] == 100

大多数返回副本,我收到SettingWithCopyWarning警告. loc函数表示索引与日期时间不兼容.这是我应该能够做的事情.更新切片后,我想要的结果是:

Most return a copy and I get a SettingWithCopyWarning warning. The loc function says the index is incompatible with datetime. Is this something I should be able to do. The result I would like after updating the slice is:

In [1]: df
Out[1]:

                         A        B         C
TIME
2014-01-02 14:00:00 -1.172285  1.706200    NaN
2014-01-02 14:05:00  0.039511 -0.320798    NaN
2014-01-02 14:10:00 -0.192179 -0.539397    100
2014-01-02 14:15:00 -0.475917 -0.280055    100
2014-01-02 14:20:00  0.163376  1.124602    100
2014-01-02 14:25:00 -2.477812  0.656750    NaN

有人可以提出建议吗?我是用错误的方式来处理这个问题吗?

Can anyone please suggest a way to this? Am I approaching this the wrong way?

谢谢

推荐答案

一种方法是使用loc并将条件包装在括号中,并使用按位运算符&,在比较时,需要按位运算符值数组,而不是单个值,由于运算符优先级,因此需要使用括号.然后,我们可以使用它使用loc来执行标签选择,并像这样设置'C'列:

One way is to use loc and wrap your conditions in parentheses and use the bitwise oerator &, the bitwise operator is required as you are comparing an array of values and not a single value, the parentheses are required due to operator precedence. We can then use this to perform label selection using loc and set the 'C' column like so:

In [15]:

import datetime as dt
start = dt.datetime.strptime('2014-01-02 14:07:00', '%Y-%m-%d %H:%M:%S')
end = dt.datetime.strptime('2014-01-02 14:22:00', '%Y-%m-%d %H:%M:%S')
df.loc[(df.index > start) & (df.index < end), 'C'] = 100
df
Out[15]:
                            A         B    C
TIME                                        
2014-01-02 14:00:00 -1.172285  1.706200  NaN
2014-01-02 14:05:00  0.039511 -0.320798  NaN
2014-01-02 14:10:00 -0.192179 -0.539397  100
2014-01-02 14:15:00 -0.475917 -0.280055  100
2014-01-02 14:20:00  0.163376  1.124602  100
2014-01-02 14:25:00 -2.477812  0.656750  NaN

如果我们查看您尝试过的每种方法以及它们为何不起作用:

If we look at each method you tried and why they didn't work:

sdf = df[start:end] #  will raise KeyError if start and end are not present in index
sdf = df[start < df.index < end] #  will raise ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(), this is because you are comparing arrays of values not a single scalar value
sdf = df.ix[start:end] # raises KeyError same as first example
sdf = df.loc[start:end] #  raises KeyError same as first example
sdf = df.truncate(before=start, after=end, copy=False) # generates correct result but operations on this will raise SettingWithCopyWarning as you've found

编辑

您可以将sdf设置为掩码,并将其与loc一起使用以设置"C"列:

You can set sdf to the mask and use this with loc to set your 'C' column:

In [7]:

import datetime as dt
start = dt.datetime.strptime('2014-01-02 14:07:00', '%Y-%m-%d %H:%M:%S')
end = dt.datetime.strptime('2014-01-02 14:22:00', '%Y-%m-%d %H:%M:%S')
sdf = (df.index > start) & (df.index < end)
df.loc[sdf,'C'] = 100
df
Out[7]:
                            A         B    C
TIME                                        
2014-01-02 14:00:00 -1.172285  1.706200  NaN
2014-01-02 14:05:00  0.039511 -0.320798  NaN
2014-01-02 14:10:00 -0.192179 -0.539397  100
2014-01-02 14:15:00 -0.475917 -0.280055  100
2014-01-02 14:20:00  0.163376  1.124602  100
2014-01-02 14:25:00 -2.477812  0.656750  NaN

这篇关于 pandas :按日期时间切片数据框(可能不存在)和返回视图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆