pandas -返回日期范围内的单个日期并匹配工作日的二进制值 [英] Pandas - Returning single dates for a date range and match weekday binary values

查看:143
本文介绍了 pandas -返回日期范围内的单个日期并匹配工作日的二进制值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的数据集应该复制旅行公司的时间表数据集(例如,火车,公共汽车或飞机的路线等)

Below dataset is supposed to replicate a timetable data set for a travel company (e.g. routes via train or bus or plane etc.)

df = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_b', 'op_b', 'op_b', 'op_b', 'op_c', 'op_c', 'op_c', 'op_c', 'op_d', 'op_d'],
                   'from': ['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a', 'a', 'x', 'x'], 
                   'to': ['b', 'b', 'b', 'b', 'd', 'd', 'd', 'd', 'b', 'b', 'b', 'b', 'y', 'y'], 
                   'valid_from': ['13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '15/02/2019', '15/02/2019', '15/02/2019', '15/02/2019', '20/05/2019', '21/05/2019'],
                   'valid_to': ['20/11/2018', '20/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '21/11/2018', '21/11/2018', '21/02/2019', '21/02/2019', '20/02/2019', '20/02/2019', '30/05/2019', '29/05/2019'], 
                   'day_of_week': ['0101010', '0100010', '0111100', '1101100', '0101010', '0100010', '0111100', '1101100', '0001101', '1110000', '0000000', '0000001', '1000000', '1000001']})
    print(df)

operator-运营公司,例如ABC航空公司,DEF火车公司

operator - operating company e.g. ABC Airlines, DEF Train Company

from-例如伦敦,纽约,纳尼亚

from - departing from e.g. London, New York, Narnia

to-目的地,例如巴黎

valid_from-日期范围的开始(可以是一周中的任何一天),运营商可以购买该路线,例如2019-11-01

valid_from - start of a date range (can be any day of the week) where route is available for purchase for the operator e.g. 2019-11-01

valid_to-日期范围(可以是一周中的任何一天),可供操作员购买路线,例如2019-11-12

valid_to - end of date range (can be any day of the week) where route is available to purchase for the operator e.g. 2019-11-12

day_of_week-二进制,表示Sun到Sat的可用性,例如0101010表示路线在日期范围内的周一,周三和周五可用

day_of_week - binary representing availability for Sun to Sat e.g. 0101010 means route is available on Mon, Wed, and Fri in the date range

将日期范围转换为单个日期及其从day_of_week字段派生的可用性的输出数据集.主要目标是获得一个干净的数据集,然后将其加载到Tableau中,然后生成一个可以轻松显示路线可用性的报告.

An output dataset that converts the date range to single individual dates and their availability derived from the day_of_week field. The main goal is to get a clean dataset which can then loaded into Tableau to then build a report that would easily show route availability.

dfout = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_a', 'op_a', 'op_a'], 'from': ['a', 'a', 'a', 'a', 'a', 'a', 'a'], 'to': ['b', 'b', 'b', 'b', 'b', 'b', 'b'], 'date': ['13/11/2018', '14/11/2018', '15/11/2018', '16/11/2018', '17/11/2018', '18/11/2018', '19/11/2018'], 'available': [1, 1, 1, 1, 0, 1, 1]})
print(dfout)

因此这将是日期范围2018-11-132018-11-19的路径abop_a的输出.

So this would be the output for op_a for the route a to b for date range 2018-11-13 to 2018-11-19.

数据集很奇怪.日期范围可以是非常随机的,但是day_of_week始终会显示该日期范围内星期几的可用性.某些相同的日期范围甚至可能具有不同的day_of_week二进制组合,但是基本上,如果在任何时候day_of_week指示给定日期范围,路线和运营商的可用性,则将其视为该日期可用

The dataset is weird as. Date ranges can be quite random, but day_of_week will always show availability for the days of the week in that date range. Some of the same date ranges may even have different day_of_week binary combinations, but essentially if at any point the day_of_week indicates an availability for a given date range, route and operator, then it will be taken to be available for the date.

使用以下帮助:熊猫:将日期范围解压缩为单个日期

import pandas as pd

df = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_b', 'op_b', 'op_b', 'op_b', 'op_c', 'op_c', 'op_c', 'op_c', 'op_d', 'op_d'],
                   'from': ['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a', 'a', 'x', 'x'], 
                   'to': ['b', 'b', 'b', 'b', 'd', 'd', 'd', 'd', 'b', 'b', 'b', 'b', 'y', 'y'], 
                   'valid_from': ['13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '15/02/2019', '15/02/2019', '15/02/2019', '15/02/2019', '20/05/2019', '21/05/2019'],
                   'valid_to': ['20/11/2018', '20/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '21/11/2018', '21/11/2018', '21/02/2019', '21/02/2019', '20/02/2019', '20/02/2019', '30/05/2019', '29/05/2019'], 
                   'day_of_week': ['0101010', '0100010', '0111100', '1101100', '0101010', '0100010', '0111100', '1101100', '0001101', '1110000', '0000000', '0000001', '1000000', '1000001']})

df.set_index(['operator', 'from','to'], inplace=True)

df['valid_from'] = pd.to_datetime(df['valid_from'])
df['valid_to'] = pd.to_datetime(df['valid_to'])

df['row'] = range(len(df))
starts = df[['valid_from', 'day_of_week', 'row']].rename(columns={'valid_from': 'date'})
ends = df[['valid_to', 'day_of_week', 'row']].rename(columns={'valid_to':'date'})

df_decomp = pd.concat([starts, ends])
df_decomp = df_decomp.set_index('row', append=True)
df_decomp.sort_index()

df_decomp = df_decomp.groupby(level=[0,1,2,3]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))

结果看起来很有希望.我最后的想法是:

Result looks promising. My final thoughts are to:

  1. 添加一个weekday列,该列返回date的工作日,以Sunday开头为0
  2. 添加一个available列,该列使用weekday作为位置索引返回day_of_week中的二进制值
  3. 最后,要以某种方式删除重复的operatorfromto行,并保留具有1available,并删除具有0或没有1的行.那些operators'/from's/to的值,然后将可用值保留为0 ...
  1. add a weekday column that returns the weekday of the date starting with Sunday as 0
  2. add an available column that returns the binary value in day_of_week using weekday as the position index
  3. lastly, to somehow remove duplicate operator,from and to rows and keeping available's that have 1 and dropping those that are 0 or if there are no 1's for those operators'/from's/to's then keep the available as 0...

疯狂...为冗长的歉意,我希望我能有所作为.在这方面的任何帮助将不胜感激.

madness...apologies for the long-windedness and I hope I'm making some sense. Any help on this would be much appreciated.

  • 更新了上面的我尝试做的事情"部分.
  • 更新了数据集,以便在日期中添加更多的变化(仍然是刚刚调整了valid_to日期的同一数据集)
  • Updated the 'What I've tried to do' part above.
  • Updated dataset a tad to include a bit more variety in the dates (still the same dataset just adjusted valid_to dates)

推荐答案

这可以解决问题:

import pandas as pd
import numpy as np

# dataset
df = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_b', 'op_b', 'op_b', 'op_b', 'op_c', 'op_c', 'op_c', 'op_c', 'op_d', 'op_d'],
                   'from': ['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a', 'a', 'x', 'x'], 
                   'to': ['b', 'b', 'b', 'b', 'd', 'd', 'd', 'd', 'b', 'b', 'b', 'b', 'y', 'y'], 
                   'valid_from': ['13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '15/02/2019', '15/02/2019', '15/02/2019', '15/02/2019', '20/05/2019', '21/05/2019'],
                   'valid_to': ['20/11/2018', '20/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '21/11/2018', '21/11/2018', '21/02/2019', '21/02/2019', '20/02/2019', '20/02/2019', '30/05/2019', '29/05/2019'], 
                   'day_of_week': ['0101010', '0100010', '0111100', '1101100', '0101010', '0100010', '0111100', '1101100', '0001101', '1110000', '0000000', '0000001', '1000000', '1000001']})

# set operator, from, to as index
df.set_index(['operator', 'from','to'], inplace=True)

# convert date ranges to datetime types
df['valid_from'] = pd.to_datetime(df['valid_from'])
df['valid_to'] = pd.to_datetime(df['valid_to'])

# bring individual dates in date ranges and stack
df['row'] = range(len(df))
starts = df[['valid_from', 'day_of_week', 'row']].rename(columns={'valid_from': 'date'})
ends = df[['valid_to', 'day_of_week', 'row']].rename(columns={'valid_to':'date'})

df_decomp = pd.concat([starts, ends])
df_decomp = df_decomp.set_index('row', append=True)
df_decomp.sort_index()

df_decomp = df_decomp.groupby(level=[0,1,2,3]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))

# remove indexes
df_decomp.reset_index(level=3, drop=True, inplace=True)
df_decomp.reset_index(inplace=True)

# create weekday column
df_decomp['weekday'] = np.where(df_decomp['date'].dt.weekday == 6, 
                            df_decomp['date'].dt.weekday - 6, 
                            df_decomp['date'].dt.weekday + 1)

# use weekday to extract availability in day_of_week
df_decomp['available'] = [b[a] for a, b in zip(df_decomp['weekday'], df_decomp['day_of_week'])]
df_decomp['available'] = df_decomp['available'].astype('int')

# sort values and remove duplicates with available=1 taking priority
df_decomp = df_decomp.sort_values('available', ascending=False).drop_duplicates(['operator','from','to','date'])
df_decomp = df_decomp.sort_values(['operator','from','to','date'])

df_decomp

这篇关于 pandas -返回日期范围内的单个日期并匹配工作日的二进制值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆