pandas -返回日期范围内的单个日期并匹配工作日的二进制值 [英] Pandas - Returning single dates for a date range and match weekday binary values
问题描述
下面的数据集应该复制旅行公司的时间表数据集(例如,火车,公共汽车或飞机的路线等)
Below dataset is supposed to replicate a timetable data set for a travel company (e.g. routes via train or bus or plane etc.)
df = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_b', 'op_b', 'op_b', 'op_b', 'op_c', 'op_c', 'op_c', 'op_c', 'op_d', 'op_d'],
'from': ['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a', 'a', 'x', 'x'],
'to': ['b', 'b', 'b', 'b', 'd', 'd', 'd', 'd', 'b', 'b', 'b', 'b', 'y', 'y'],
'valid_from': ['13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '15/02/2019', '15/02/2019', '15/02/2019', '15/02/2019', '20/05/2019', '21/05/2019'],
'valid_to': ['20/11/2018', '20/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '21/11/2018', '21/11/2018', '21/02/2019', '21/02/2019', '20/02/2019', '20/02/2019', '30/05/2019', '29/05/2019'],
'day_of_week': ['0101010', '0100010', '0111100', '1101100', '0101010', '0100010', '0111100', '1101100', '0001101', '1110000', '0000000', '0000001', '1000000', '1000001']})
print(df)
operator
-运营公司,例如ABC航空公司,DEF火车公司
operator
- operating company e.g. ABC Airlines, DEF Train Company
from
-例如伦敦,纽约,纳尼亚
from
- departing from e.g. London, New York, Narnia
to
-目的地,例如巴黎
valid_from
-日期范围的开始(可以是一周中的任何一天),运营商可以购买该路线,例如2019-11-01
valid_from
- start of a date range (can be any day of the week) where route is available for purchase for the operator e.g. 2019-11-01
valid_to
-日期范围(可以是一周中的任何一天),可供操作员购买路线,例如2019-11-12
valid_to
- end of date range (can be any day of the week) where route is available to purchase for the operator e.g. 2019-11-12
day_of_week
-二进制,表示Sun到Sat的可用性,例如0101010
表示路线在日期范围内的周一,周三和周五可用
day_of_week
- binary representing availability for Sun to Sat e.g. 0101010
means route is available on Mon, Wed, and Fri in the date range
将日期范围转换为单个日期及其从day_of_week
字段派生的可用性的输出数据集.主要目标是获得一个干净的数据集,然后将其加载到Tableau中,然后生成一个可以轻松显示路线可用性的报告.
An output dataset that converts the date range to single individual dates and their availability derived from the day_of_week
field. The main goal is to get a clean dataset which can then loaded into Tableau to then build a report that would easily show route availability.
dfout = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_a', 'op_a', 'op_a'], 'from': ['a', 'a', 'a', 'a', 'a', 'a', 'a'], 'to': ['b', 'b', 'b', 'b', 'b', 'b', 'b'], 'date': ['13/11/2018', '14/11/2018', '15/11/2018', '16/11/2018', '17/11/2018', '18/11/2018', '19/11/2018'], 'available': [1, 1, 1, 1, 0, 1, 1]})
print(dfout)
因此这将是日期范围2018-11-13
至2018-11-19
的路径a
至b
的op_a
的输出.
So this would be the output for op_a
for the route a
to b
for date range 2018-11-13
to 2018-11-19
.
数据集很奇怪.日期范围可以是非常随机的,但是day_of_week
始终会显示该日期范围内星期几的可用性.某些相同的日期范围甚至可能具有不同的day_of_week
二进制组合,但是基本上,如果在任何时候day_of_week
指示给定日期范围,路线和运营商的可用性,则将其视为该日期可用
The dataset is weird as. Date ranges can be quite random, but day_of_week
will always show availability for the days of the week in that date range. Some of the same date ranges may even have different day_of_week
binary combinations, but essentially if at any point the day_of_week
indicates an availability for a given date range, route and operator, then it will be taken to be available for the date.
使用以下帮助:熊猫:将日期范围解压缩为单个日期
import pandas as pd
df = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_b', 'op_b', 'op_b', 'op_b', 'op_c', 'op_c', 'op_c', 'op_c', 'op_d', 'op_d'],
'from': ['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a', 'a', 'x', 'x'],
'to': ['b', 'b', 'b', 'b', 'd', 'd', 'd', 'd', 'b', 'b', 'b', 'b', 'y', 'y'],
'valid_from': ['13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '15/02/2019', '15/02/2019', '15/02/2019', '15/02/2019', '20/05/2019', '21/05/2019'],
'valid_to': ['20/11/2018', '20/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '21/11/2018', '21/11/2018', '21/02/2019', '21/02/2019', '20/02/2019', '20/02/2019', '30/05/2019', '29/05/2019'],
'day_of_week': ['0101010', '0100010', '0111100', '1101100', '0101010', '0100010', '0111100', '1101100', '0001101', '1110000', '0000000', '0000001', '1000000', '1000001']})
df.set_index(['operator', 'from','to'], inplace=True)
df['valid_from'] = pd.to_datetime(df['valid_from'])
df['valid_to'] = pd.to_datetime(df['valid_to'])
df['row'] = range(len(df))
starts = df[['valid_from', 'day_of_week', 'row']].rename(columns={'valid_from': 'date'})
ends = df[['valid_to', 'day_of_week', 'row']].rename(columns={'valid_to':'date'})
df_decomp = pd.concat([starts, ends])
df_decomp = df_decomp.set_index('row', append=True)
df_decomp.sort_index()
df_decomp = df_decomp.groupby(level=[0,1,2,3]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))
结果看起来很有希望.我最后的想法是:
Result looks promising. My final thoughts are to:
- 添加一个
weekday
列,该列返回date
的工作日,以Sunday
开头为0
- 添加一个
available
列,该列使用weekday
作为位置索引返回day_of_week
中的二进制值 - 最后,要以某种方式删除重复的
operator
,from
和to
行,并保留具有1
的available
,并删除具有0
或没有1
的行.那些operators
'/from
's/to
的值,然后将可用值保留为0
...
- add a
weekday
column that returns the weekday of thedate
starting withSunday
as0
- add an
available
column that returns the binary value inday_of_week
usingweekday
as the position index - lastly, to somehow remove duplicate
operator
,from
andto
rows and keepingavailable
's that have1
and dropping those that are0
or if there are no1
's for thoseoperators
'/from
's/to
's then keep the available as0
...
疯狂...为冗长的歉意,我希望我能有所作为.在这方面的任何帮助将不胜感激.
madness...apologies for the long-windedness and I hope I'm making some sense. Any help on this would be much appreciated.
- 更新了上面的我尝试做的事情"部分.
- 更新了数据集,以便在日期中添加更多的变化(仍然是刚刚调整了
valid_to
日期的同一数据集)
- Updated the 'What I've tried to do' part above.
- Updated dataset a tad to include a bit more variety in the dates (still the same dataset just adjusted
valid_to
dates)
推荐答案
这可以解决问题:
import pandas as pd
import numpy as np
# dataset
df = pd.DataFrame({'operator': ['op_a', 'op_a', 'op_a', 'op_a', 'op_b', 'op_b', 'op_b', 'op_b', 'op_c', 'op_c', 'op_c', 'op_c', 'op_d', 'op_d'],
'from': ['a', 'a', 'a', 'a', 'c', 'c', 'c', 'c', 'a', 'a', 'a', 'a', 'x', 'x'],
'to': ['b', 'b', 'b', 'b', 'd', 'd', 'd', 'd', 'b', 'b', 'b', 'b', 'y', 'y'],
'valid_from': ['13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '13/11/2018', '15/02/2019', '15/02/2019', '15/02/2019', '15/02/2019', '20/05/2019', '21/05/2019'],
'valid_to': ['20/11/2018', '20/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '19/11/2018', '21/11/2018', '21/11/2018', '21/02/2019', '21/02/2019', '20/02/2019', '20/02/2019', '30/05/2019', '29/05/2019'],
'day_of_week': ['0101010', '0100010', '0111100', '1101100', '0101010', '0100010', '0111100', '1101100', '0001101', '1110000', '0000000', '0000001', '1000000', '1000001']})
# set operator, from, to as index
df.set_index(['operator', 'from','to'], inplace=True)
# convert date ranges to datetime types
df['valid_from'] = pd.to_datetime(df['valid_from'])
df['valid_to'] = pd.to_datetime(df['valid_to'])
# bring individual dates in date ranges and stack
df['row'] = range(len(df))
starts = df[['valid_from', 'day_of_week', 'row']].rename(columns={'valid_from': 'date'})
ends = df[['valid_to', 'day_of_week', 'row']].rename(columns={'valid_to':'date'})
df_decomp = pd.concat([starts, ends])
df_decomp = df_decomp.set_index('row', append=True)
df_decomp.sort_index()
df_decomp = df_decomp.groupby(level=[0,1,2,3]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))
# remove indexes
df_decomp.reset_index(level=3, drop=True, inplace=True)
df_decomp.reset_index(inplace=True)
# create weekday column
df_decomp['weekday'] = np.where(df_decomp['date'].dt.weekday == 6,
df_decomp['date'].dt.weekday - 6,
df_decomp['date'].dt.weekday + 1)
# use weekday to extract availability in day_of_week
df_decomp['available'] = [b[a] for a, b in zip(df_decomp['weekday'], df_decomp['day_of_week'])]
df_decomp['available'] = df_decomp['available'].astype('int')
# sort values and remove duplicates with available=1 taking priority
df_decomp = df_decomp.sort_values('available', ascending=False).drop_duplicates(['operator','from','to','date'])
df_decomp = df_decomp.sort_values(['operator','from','to','date'])
df_decomp
这篇关于 pandas -返回日期范围内的单个日期并匹配工作日的二进制值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!