pandas仅使用datetime索引替换列的一部分 [英] pandas replace only part of a column with datetime index

查看:117
本文介绍了pandas仅使用datetime索引替换列的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个跟进问题:
pandas replace只是列的一部分

这是我当前的输入:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date

pd.set_option('expand_frame_repr', False)

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 2, 1))
df['x'] = np.where (df['Open'] > df['High'].shift(-2), 1, np.nan)
print (df.round(2))

# this section of code works perfectly for an integer based index.......
ii = df[pd.notnull(df['x'])].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.loc[ci:ci+2,'x'] = 1.0
# end of section that works perfectly for an integer based index......

print (df.round(2))

这是我当前的输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0 
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  NaN
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  NaN
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  NaN
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN
Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 15, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
  File "C:\stocks\question4 for stack overflow.py", line 15, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
TypeError: Cannot cast ufunc greater input from dtype('<m8[ns]') to dtype('<m8') with casting rule 'same_kind'

我想要做的是将列'x'更改为连续三个1的集合,不重叠。所需的输出是:

What I want to do is to change column 'x' to be a set of three 1's in a row, non-overlapping. The desired output is:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN

因此,1月5日,18日和31日更改从NaN到1.0。

So, January 5, 18 and 31 change from NaN to 1.0.

正如上面的评论所述,代码的第二部分适用于基于整数的索引。但是,当日期时间索引为dtype datetime64 [ns]时,它不起作用。我想我只需对代码的第二部分进行微调就可以使其工作(希望如此)。

As the comment above states, the second portion of the code works perfect for an integer based index. However it doesn't work when with the datetime index of dtype datetime64[ns]. I think I need just a tiny tweak to the second portion of the code to get this to work (hopefully).

提前致谢,
David

Thanks in advance, David

--------------------------跟进部分-------- ----------------------------

--------------------------follow up section ------------------------------------

感谢您和我一起待在那里B2002。由于它的简洁,我真的想继续使用顶级解决方案。当我开箱即用你的代码时,这是输出:

Thanks for hanging in there with me b2002. i am really trying to keep with the top solution due to it's brevity. when i run your code out of the box, here is the output:

原始输出

.. .jj = [ii [i] for i in range(1,len(ii))如果dd [i-1]> 2] ...

...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]...

.. 。[ci:ci + 2] = 1.0 ...

... a[ci:ci+2] = 1.0...

              Open    High     Low   Close    Volume  Adj Close    x  ii  dd  jj  jj  desired
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0  1
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0  1
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0  2          x    x
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  1.0  3   1  
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN  4   1
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN  5   1
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN  6   1
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN  7   1
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0  1
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0  1
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0  10  3   x  x    x
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  1.0  11  1
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0  1
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0  1
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0  1
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  1.0  15  4   z  z
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  1.0  16  1
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0  1
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0  1
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0  19  3   x  x    x
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  1.0  20  1              

我真的想了解发生了什么,所以我设置了列ii,dd,jj之前,jj之后和期望。当我将输入调整为:

I am really trying to understand what is going on so i set up columns ii, dd, jj before, jj after, and desired. when i tweak the input to:

... jj = [ii [i] for i in range(1,len(ii))dd [i-1] ]> 2] ...

...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]...

... a [ci:ci + 1] = 1.0 ...

... a[ci:ci+1] = 1.0...

这是输出:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  NaN

唯一的问题是1月25日,其中np .diff给出的值为4.我只需要将代码跳过值4就可以单独保留现有的三个1。我尝试修改dd,然后进入jj,这两次尝试无效:

the only problem is with January 25th, where the np.diff is giving a value of 4. i just need the code to skip a value of 4 to leave the existing sets of three 1's alone. i tried to modify dd before it goes to jj with these two attempts which didn't work:

dd[dd == 4] = 1

dd = [3 if x==4 else x for x in dd]

也尝试用这样修改jj条目:

also tried to modify the jj entry with this:

jj = [ii [i] for i in range(1,len(ii))如果((dd == 4)或(dd [i-1]> 2))]

jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]

会出现此错误消息:

Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 109, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
  File "C:\stocks\question4 for stack overflow.py", line 109, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
ValueError: The truth value of an array with more than one element is     ambiguous. Use a.any() or a.all()

任何人都有任何想法?

推荐答案

---------------------最终答案/最终解决--- --------
嗯,这是几个星期的兼职工作和几十个小时,但我终于明白了!我知道这个代码是一个钝器,但它的工作原理。如果有人有减少代码或加快速度的建议,请告诉我!

--------------------- FINAL ANSWER / FINALLY SOLVED ----------- well, it's been a couple weeks of part time effort and several dozen hours but i finally got it! i know this code is a blunt instrument but it works. if anyone has suggestions on reducing the code or speeding it up, please let me know!

这里是最终输入:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date 

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 6, 1))
df['x'] = np.where (df['Open'] < df['High'].shift(-2), 1, np.nan)
df['x2'] = df['x']

test = 0

for i in np.nditer(df['x2'], op_flags=['readwrite']):

    if test == 4:
        test = 0

    if test == 3:
        i[...] = 3
        test = 4

    if test == 2:
        i[...] = 2
        test = 3

    if (test == 1) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == np.nan):
        i[...] = np.nan
        test = 1

print (df.round(2))

这是最终输出的一部分:

here is a section the final output:

              Open    High     Low   Close    Volume  Adj Close    x   x2
Date                                                                     
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  NaN  NaN
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  NaN  NaN
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  1.0  2.0
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  1.0  3.0
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  1.0  1.0
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  1.0  2.0
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  1.0  3.0
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  NaN  NaN
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  NaN  NaN
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  1.0  2.0
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  NaN  3.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  NaN  NaN
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  NaN  NaN
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN  2.0
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  NaN  3.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  NaN  NaN
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  1.0  2.0
2000-02-02  137.12  137.62  134.06  134.06  21820200      26.32  1.0  3.0
2000-02-03  135.94  139.81  135.25  139.25  20232000      27.34  1.0  1.0
2000-02-04  141.00  143.12  140.50  141.56  18167100      27.79  NaN  2.0
2000-02-07  141.69  141.75  135.88  136.50  18285000      26.80  NaN  3.0

i将列x2中的值更改为显示1 - 3而不是仅显示1以查看新系列在旧系列结束时的开始时间。

i altered the values in column x2 to display 1 - 3 instead of just 1 to see when a new series started at the end of an old series.

这篇关于pandas仅使用datetime索引替换列的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆