如何重新采样直到满足特定日期标准 [英] How to resample until a specific date criteria is met
问题描述
在社区的一些帮助下,我设法使用了以下功能.上一个关于构建函数的问题我是试图找出如何使重新采样的日期运行到出现在任何代码的任一输入数据集中任何位置的最新日期.下面我包括了我得到的当前输出和我想要的输出.
输入数据:
输入 1 df1 - 输入
日期代码数量0 2019-01-10 A 201 2019-01-10 乙 122 2019-01-10 C 103 2019-01-11 A 24 2019-01-11 乙 305 2019-01-11 中 27 2019-01-12 A 48 2019-01-12 乙 611 2019-01-13 A 1012 2019-01-13 乙 1213 2019-01-13 C 1
输入 2 df2 - 出站
日期代码数量0 2019-01-11 A 51 2019-01-11 乙 12 2019-01-11 C 33 2019-01-12 A 1006 2019-01-13 B 17 2019-01-13 C 18 2019-01-15 A 19 2019-01-16 乙 1
现有代码:
from numba import njit@njitdef poscumsum(x):总计 = 0结果 = np.empty(x.shape)对于 enumerate(x) 中的 i, y:总计 += y如果总<0:总计 = 0结果[i] = 总计返回结果a = df1.set_index(['code', 'date'])b = df2.set_index(['code', 'date'])idx = a.index.union(b.index).sort_values()df3 = (a.reindex(idx,fill_value=0) - b.reindex(idx,fill_value=0))df3 = df3.groupby('code').resample('D', level='date').sum()df3['qty'] = df3.groupby('code')['qty'].transform(lambda g: poscumsum(g.values))
电流输出
每个代码仅代表它们出现在 In 或 Out dfs 中的日期.
代码日期数量0 一 2019-01-10 201 A 2019-01-11 172 一 2019-01-12 03 一 2019-01-13 104 一 2019-01-14 105 一 2019-01-15 96 乙 2019-01-10 127 乙 2019-01-11 418 乙 2019-01-12 479 乙 2019-01-13 5810 乙 2019-01-14 5811 乙 2019-01-15 5812 乙 2019-01-16 5713 C 2019-01-10 1014 C 2019-01-11 915 C 2019-01-12 916 C 2019-01-13 9
期望输出:
每个代码代表 2019-01-10 和 2019-01-10 之间的每个日期2019-01-16
代码日期数量0 一 2019-01-10 201 A 2019-01-11 172 一 2019-01-12 03 一 2019-01-13 104 一 2019-01-14 105 一 2019-01-15 96 一 2019-01-16 97 乙 2019-01-10 128 乙 2019-01-11 419 乙 2019-01-12 4710 乙 2019-01-13 5811 乙 2019-01-14 5812 乙 2019-01-15 5813 乙 2019-01-16 5714 C 2019-01-10 1015 C 2019-01-11 916 C 2019-01-12 917 C 2019-01-13 918 C 2019-01-14 919 C 2019-01-15 920℃ 2019-01-16 9
好的,这里是 poscumsum
的 2D 版本(并概括为将运行总和限制在最小值和/或最大值):
@njitdef cumsum_ipped_2d(x, xmin=None, xmax=None):n, m = x.shape结果 = np.empty_like(x)如果 n == 0:返回结果总计 = np.zeros_like(x[0])对于范围(n)中的我:总计 += x[i]如果 xmin 不是 None:总计[总计
这是如何使用它(现在您希望所有日期都跨越同一时期);好消息是没有更多的 groupby
(所以它比以往任何时候都快):
a = df1.pivot('date', 'code', 'qty')b = df2.pivot('日期', '代码', '数量')idx = a.index.union(b.index).sort_values()df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0)).resample('D').sum()df3.values[:, :] = cumsum_ipped_2d(df3.values, xmin=0)
或者,在两行(复杂)中:
df3 = (df1.set_index(['date', 'code']).subtract(df2.set_index(['date', 'code']), fill_value=0).unstack('code', fill_value=0).resample('D').sum())df3.values[:, :] = cumsum_ipped_2d(df3.values, xmin=0)
关于您的数据:
<预><代码>>>>df3代码 A B C日期2019-01-10 20.0 12.0 10.02019-01-11 17.0 41.0 9.02019-01-12 0.0 41.0 9.02019-01-13 0.0 52.0 9.02019-01-14 0.0 52.0 9.02019-01-15 0.0 52.0 9.02019-01-16 0.0 51.0 9.0当然,你可以自由stack
回到一个瘦的df
,重新排序,删除索引等.例如,匹配你想要的输出:><预><代码>>>>df3.stack().swaplevel(0,1).sort_index().reset_index()代码日期数量0 一 2019-01-10 20.01 A 2019-01-11 17.02 A 2019-01-12 0.03 一 2019-01-13 10.04 A 2019-01-14 10.05 A 2019-01-15 9.06 一 2019-01-16 9.07 乙 2019-01-10 12.08 乙 2019-01-11 41.09 乙 2019-01-12 47.010 乙 2019-01-13 58.011 乙 2019-01-14 58.012 乙 2019-01-15 58.013 乙 2019-01-16 57.014 C 2019-01-10 10.015 C 2019-01-11 9.016 C 2019-01-12 9.017 C 2019-01-13 9.018 C 2019-01-14 9.019 C 2019-01-15 9.020℃ 2019-01-16 9.0
With some help from the community I have managed to get to the below function. previous question on building the functionI am trying to work out how to get the resampled date to run to the latest date that appears in anywhere in either of the input data sets for any code. Below I have included the current output I am getting and my desired output.
Input data:
Input 1 df1 - In
date code qty
0 2019-01-10 A 20
1 2019-01-10 B 12
2 2019-01-10 C 10
3 2019-01-11 A 2
4 2019-01-11 B 30
5 2019-01-11 C 2
7 2019-01-12 A 4
8 2019-01-12 B 6
11 2019-01-13 A 10
12 2019-01-13 B 12
13 2019-01-13 C 1
Input 2 df2 - Outbound
date code qty
0 2019-01-11 A 5
1 2019-01-11 B 1
2 2019-01-11 C 3
3 2019-01-12 A 100
6 2019-01-13 B 1
7 2019-01-13 C 1
8 2019-01-15 A 1
9 2019-01-16 B 1
Existing Code:
from numba import njit
@njit
def poscumsum(x):
total = 0
result = np.empty(x.shape)
for i, y in enumerate(x):
total += y
if total < 0:
total = 0
result[i] = total
return result
a = df1.set_index(['code', 'date'])
b = df2.set_index(['code', 'date'])
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0))
df3 = df3.groupby('code').resample('D', level='date').sum()
df3['qty'] = df3.groupby('code')['qty'].transform(
lambda g: poscumsum(g.values))
Current Output
each code is only represented for dates on which they appear in the In or Out dfs.
code date qty
0 A 2019-01-10 20
1 A 2019-01-11 17
2 A 2019-01-12 0
3 A 2019-01-13 10
4 A 2019-01-14 10
5 A 2019-01-15 9
6 B 2019-01-10 12
7 B 2019-01-11 41
8 B 2019-01-12 47
9 B 2019-01-13 58
10 B 2019-01-14 58
11 B 2019-01-15 58
12 B 2019-01-16 57
13 C 2019-01-10 10
14 C 2019-01-11 9
15 C 2019-01-12 9
16 C 2019-01-13 9
Desired Output:
each code is represented for each date between 2019-01-10 & 2019-01-16
code date qty
0 A 2019-01-10 20
1 A 2019-01-11 17
2 A 2019-01-12 0
3 A 2019-01-13 10
4 A 2019-01-14 10
5 A 2019-01-15 9
6 A 2019-01-16 9
7 B 2019-01-10 12
8 B 2019-01-11 41
9 B 2019-01-12 47
10 B 2019-01-13 58
11 B 2019-01-14 58
12 B 2019-01-15 58
13 B 2019-01-16 57
14 C 2019-01-10 10
15 C 2019-01-11 9
16 C 2019-01-12 9
17 C 2019-01-13 9
18 C 2019-01-14 9
19 C 2019-01-15 9
20 C 2019-01-16 9
Ok, here is a 2D version of poscumsum
(and generalized to cap the running sum at min and/or max):
@njit
def cumsum_capped_2d(x, xmin=None, xmax=None):
n, m = x.shape
result = np.empty_like(x)
if n == 0:
return result
total = np.zeros_like(x[0])
for i in range(n):
total += x[i]
if xmin is not None:
total[total < xmin] = xmin
if xmax is not None:
total[total > xmax] = xmax
result[i] = total
return result
And here is how to use it (now that you want all dates spanning the same period); the good news is that there is no more groupby
(so it is faster than ever):
a = df1.pivot('date', 'code', 'qty')
b = df2.pivot('date', 'code', 'qty')
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0)).resample('D').sum()
df3.values[:, :] = cumsum_capped_2d(df3.values, xmin=0)
Or, in two (convoluted) lines:
df3 = (df1.set_index(['date', 'code']).subtract(df2.set_index(['date', 'code']), fill_value=0)
.unstack('code', fill_value=0).resample('D').sum())
df3.values[:, :] = cumsum_capped_2d(df3.values, xmin=0)
On your data:
>>> df3
code A B C
date
2019-01-10 20.0 12.0 10.0
2019-01-11 17.0 41.0 9.0
2019-01-12 0.0 41.0 9.0
2019-01-13 0.0 52.0 9.0
2019-01-14 0.0 52.0 9.0
2019-01-15 0.0 52.0 9.0
2019-01-16 0.0 51.0 9.0
Of course, you are free stack
back into a skinny df
, re-order, drop index, etc. For example, to match your desired output:
>>> df3.stack().swaplevel(0,1).sort_index().reset_index()
code date qty
0 A 2019-01-10 20.0
1 A 2019-01-11 17.0
2 A 2019-01-12 0.0
3 A 2019-01-13 10.0
4 A 2019-01-14 10.0
5 A 2019-01-15 9.0
6 A 2019-01-16 9.0
7 B 2019-01-10 12.0
8 B 2019-01-11 41.0
9 B 2019-01-12 47.0
10 B 2019-01-13 58.0
11 B 2019-01-14 58.0
12 B 2019-01-15 58.0
13 B 2019-01-16 57.0
14 C 2019-01-10 10.0
15 C 2019-01-11 9.0
16 C 2019-01-12 9.0
17 C 2019-01-13 9.0
18 C 2019-01-14 9.0
19 C 2019-01-15 9.0
20 C 2019-01-16 9.0
这篇关于如何重新采样直到满足特定日期标准的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!