pandas 自定义功能来查找是否是星期一,星期二等的第一,第二等-欢迎所有建议 [英] Pandas custom function to find whether it is the 1st, 2nd etc Monday, Tuesday, etc - all suggestions welcome

查看:62
本文介绍了 pandas 自定义功能来查找是否是星期一,星期二等的第一,第二等-欢迎所有建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我有以下代码,该代码分为5列,日期为ohlc.然后,它创建一个列"dow"来保存星期几.到目前为止一切顺利:

So I have the following code which reads in 5 columns, date ohlc. It then creates a column 'dow' to hold day of week. So far so good:

import numpy as np
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Forex/EURUSD-2018_12_18-2020_11_01.csv',parse_dates=True,names = ['date','1','2','3','4',])

df['date'] = pd.to_datetime(df['date'])
df.index = df['date']
df['dow'] = df['date'].dt.dayofweek
#df['downum'] = df.apply(lambda x: downu(x['date']))
df

产生以下输出:

                    date                1       2       3       4       dow
date                        
2018-12-18 00:00:00 2018-12-18 00:00:00 1.13498 1.13497 1.13508 1.13494 1
2018-12-18 00:01:00 2018-12-18 00:01:00 1.13497 1.13500 1.13500 1.13496 1
2018-12-18 00:02:00 2018-12-18 00:02:00 1.13500 1.13498 1.13502 1.13495 1
2018-12-18 00:03:00 2018-12-18 00:03:00 1.13498 1.13513 1.13513 1.13498 1
2018-12-18 00:04:00 2018-12-18 00:04:00 1.13513 1.13511 1.13515 1.13511 1
... ... ... ... ... ... ...
2020-11-01 23:55:00 2020-11-01 23:55:00 1.16402 1.16408 1.16410 1.16401 6
2020-11-01 23:56:00 2020-11-01 23:56:00 1.16409 1.16408 1.16410 1.16405 6
2020-11-01 23:57:00 2020-11-01 23:57:00 1.16409 1.16417 1.16418 1.16408 6
2020-11-01 23:58:00 2020-11-01 23:58:00 1.16417 1.16416 1.16418 1.16414 6
2020-11-01 23:59:00 2020-11-01 23:59:00 1.16418 1.16419 1.16419 1.16413 6

现在我想添加以下自定义函数:

Now I want to do add the following custom function:

def downu(dtime):
  d = dtime.dt.day
  x = np.ceil(d/7)
  return x

并在显示数据框之前调用它,

and call it before displaying the dataframe like this:

df['downum'] = df.apply(lambda x: downu(x['date']))

添加一列以指示月份中的第一个"1",第二个"2" ....第五个"5" xxxday

to add a column indicating first '1', second '2'.... fifth '5' xxxday in the month

但是这会产生以下错误:

However this produces the following error:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

/usr/local/lib/python3.6/dist-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
   1373     else:
-> 1374         return DEFAULTPARSER.parse(timestr, **kwargs)
   1375 

11 frames
ParserError: Unknown string format: date

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()

pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

ValueError: could not convert string to Timestamp

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    603                 key = self._maybe_cast_for_get_loc(key)
    604             except ValueError as err:
--> 605                 raise KeyError(key) from err
    606 
    607         elif isinstance(key, timedelta):

KeyError: 'date'

在类似情况下,我看到了以下建议:

I have seen the following suggested in similar situations:

df['downum'] = df.apply(lambda x: downu(x.date))

但这会产生以下(可理解的)错误:

but this produces the following (understandable) error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-33-7f9aa69c7ea7> in <module>()
     12 df.index = df['date']
     13 df['dow'] = df['date'].dt.dayofweek
---> 14 df['downum'] = df.apply(lambda x: downu(x.date))
     15 df

5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   5139             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5140                 return self[name]
-> 5141             return object.__getattribute__(self, name)
   5142 
   5143     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'date'

有解决方案吗?

推荐答案

尝试:

df['downum'] = df['date'].apply(downu)

并将downu更改为:

and change downu to:

def downu(dtime):
  d = dtime.day      # use dtime.day rather than dtime.dt.day
  x = np.ceil(d/7)
  return int(x)      # cast to int since d/7 is float even after np.ceil()

df.apply()适用于整个df,即所有列依次一一(列方式)排列.每个处理列Series的索引仍然是DataFrame索引.因此,列标签日期"不能用作该正在处理的中间系列的索引.除非您的downu()函数可以接受所有列的值并忽略不相关的列,否则必须在'date'系列上使用apply().

df.apply() works on the whole df, i.e. all columns one by one in turn (column-wise). The index of each processing column Series is still the DataFrame index. Hence, the column label 'date' cannot be used as index to this intermediate Series being processed. You have to use apply() on the 'date' Series instead unless your downu() function can accept values of all columns and ignore irrelevant columns.

这里有更多解决方案,我认为其中一些是OP最初尝试的目标编码方式.我还将讨论它们在大型数据集的系统性能(执行时间)以及程序可读性(清晰度)方面的利弊.

Here are further solutions, some of which I think was OP's originally attempted target way of coding. I will also discuss their pros and cons with respect to system performance (execution time) for large dataset and also for program readability (clarity).

替代解决方案1:

%%timeit
df['downum'] = df.apply(lambda x: downu(x['date']), axis=1)

988 µs ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

替代解决方案2:

%%timeit
df['downum'] = df.apply(lambda x: downu(x.date), axis=1)

1.01 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

替代解决方案3:

%%timeit
df['downum'] = list(map(downu, df['date']))

244 µs ± 3.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

原始解决方案:

%%timeit 
df['downum'] = df['date'].apply(downu)

810 µs ± 484 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

替代解决方案1和2应该是OP最初尝试的编码目标格式.唯一的区别是添加了 axis = 1 以使其可行.现在,在 DataFrame.apply()中添加了 axis = 1 函数,可以在lambda函数中使用 downu(x ['date'] downu(x.date).有效地, axis =1 修改 DataFrame.apply()函数的行为以允许使用列索引名,这可以通过认为apply()内部的函数与Series对象一起传递来更好地理解,行对象:Series对象具有原始的DataFrame列名称/索引,现在变成了Series索引,因此,您可以通过像series_obj ['index']这样的编码来访问元素,方式与访问Series元素的方式相同.

Alternate solutions 1 and 2 are supposed to be OP's originally attempted target format of coding. The only difference is that axis=1 is added to make them workable. Now, with axis=1 added to the DataFrame.apply() function, downu(x['date'] and downu(x.date) can be used within the lambda function. Effectively, axis=1 modifies behavior of DataFrame.apply() function to allow the column index names be used. This can be better understood by thinking the function inside apply() is passed with a Series object, row-wise. The Series object has the original DataFrame column names/indices now become the Series index. Hence, you can access the elements in the same way as you access the Series elements by coding in format like series_obj['index'].

将原始解决方案(使用 pandas.Series.apply())的执行时间与使用 pandas.DataFrame.apply(...,axis = 1)的2种替代解决方案进行比较),原始解决方案仍然要快一些.就程序的可读性而言,在 df ['date'] pandas系列上工作的原始解决方案被认为是简单且更好的.

Comparing the execution time of the original solution (using pandas.Series.apply()) with the 2 alternate solutions using pandas.DataFrame.apply(... ,axis=1), the original solution is still a little bit faster. In terms of program readability, the original solution working on the df['date'] pandas Series is perceived to be simple and better.

考虑到系统性能,使用 list(map(...))的替代解决方案3比 快3倍〜4倍所有其他解决方案.请注意,此 DataFrame.apply(...,axis = 1) list(map(..))的性能比较结果是通用的,而不是特定于此问题的.您可以参考帖子

In consideration of system performance, alternate solution 3 using list(map(...)) is 3x ~ 4x times faster than all other solutions. Note that this performance comparison result of DataFrame.apply(..., axis=1) vs list(map(..)) is generic rather than specific to this question. You can refer to this answer of the post How to apply a function to two columns of Pandas dataframe for a more in-depth discussion of the topic. Some other answers of that same post are also very useful for better understanding of the apply() function.

总而言之,如果数据集不大且系统性能不是主要考虑因素,请使用原始代码使用 pandas.Series.apply()解决方案,以提高程序的可读性和清晰度.否则,出于系统性能考虑,使用 list(map(...))远远优于 pandas.DataFrame.apply(...,axis = 1)方法.

In summary, if the dataset is not large and system performance not a major consideration, use the original solution using pandas.Series.apply() in favor of program readability and clarity. Otherwise, for system performance consideration, using list(map(...)) is far superior to the pandas.DataFrame.apply(... ,axis=1) approach.

这篇关于 pandas 自定义功能来查找是否是星期一,星期二等的第一,第二等-欢迎所有建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆