平均库存历史记录表 [英] Average stock history table

查看:141
本文介绍了平均库存历史记录表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张桌子,它跟踪一些商店和产品随时间变化的库存变化。该值是绝对库存,但仅在库存发生变化时才插入新行。这种设计是为了使桌子保持较小,因为它有望迅速增长。

I have a table that tracks changes in stocks through time for some stores and products. The value is the absolute stock, but we only insert a new row when a change in stock occurs. This design was to keep the table small, because it is expected to grow rapidly.

这是示例架构和一些测试数据:

This is an example schema and some test data:

CREATE TABLE stocks (
  id serial NOT NULL,
  store_id integer NOT NULL,
  product_id integer NOT NULL,
  date date NOT NULL,
  value integer NOT NULL,
  CONSTRAINT stocks_pkey PRIMARY KEY (id),
  CONSTRAINT stocks_store_id_product_id_date_key 
    UNIQUE (store_id, product_id, date)
);

insert into stocks(store_id, product_id, date, value) values
(1,10,'2013-01-05', 4),
(1,10,'2013-01-09', 7),
(1,10,'2013-01-11', 5),
(1,11,'2013-01-05', 8),
(2,10,'2013-01-04', 12),
(2,11,'2012-12-04', 23);

我需要能够确定开始和结束日期之间每种产品和商店的平均库存,但我的问题是,简单的avg()并未考虑到更改之间的库存保持不变。

I need to be able to determine the average stock between a start and end date, per product and store, but my problem is that a simple avg() doesn't take into account that the stock remains the same between changes.

我想要的是这样的东西:

What I would like is something like this:

select s.store_id,  s.product_id , special_avg(s.value) 
from stocks s where s.date between '2013-01-01' and '2013-01-15'
group by s.store_id,  s.product_id

结果是这样的:

store_id  product_id  avg
1         10          3.6666666667
1         11          5.8666666667
2         10          9.6
2         11          23

为了使用SQL平均函数,我需要及时传播 store_id和product_id的先前值,直到发生新的更改。关于如何实现此目标的任何想法?

In order to use the SQL average function I would need to "propagate" forward in time the previous value for a store_id and product_id, until a new change occurs. Any ideas as how to achieve this?

推荐答案

此任务的特殊难度:您不能仅仅选择您时间范围内的数据点,但必须考虑时间范围内的最新数据点,以及时间范围后的最早的数据点时间范围。这对于每一行都不同,并且每个数据点可能存在也可能不存在。需要复杂的查询并使其难以使用索引。

The special difficulty of this task: you cannot just pick data points inside your time range, but have to consider the latest data point before the time range and the earliest data point after the time range additionally. This varies for every row and each data point may or may not exist. Requires a sophisticated query and makes it hard to use indexes.

您可以使用 范围类型 运算符 (Postgres 9.2 + )以简化计算:

You could use range types and operators (Postgres 9.2+) to simplify calculations:

WITH input(a,b) AS (SELECT '2013-01-01'::date  -- your time frame here
                         , '2013-01-15'::date) -- inclusive borders
SELECT store_id, product_id
     , sum(upper(days) - lower(days))                    AS days_in_range
     , round(sum(value * (upper(days) - lower(days)))::numeric
                    / (SELECT b-a+1 FROM input), 2)      AS your_result
     , round(sum(value * (upper(days) - lower(days)))::numeric
                    / sum(upper(days) - lower(days)), 2) AS my_result
FROM (
   SELECT store_id, product_id, value, s.day_range * x.day_range AS days
   FROM  (
      SELECT store_id, product_id, value
           , daterange (day, lead(day, 1, now()::date)
             OVER (PARTITION BY store_id, product_id ORDER BY day)) AS day_range 
      FROM   stock
      ) s
   JOIN  (
      SELECT daterange(a, b+1) AS day_range
      FROM   input
      ) x ON s.day_range && x.day_range
   ) sub
GROUP  BY 1,2
ORDER  BY 1,2;

注意,我使用列名 day 而不是日期。我从不使用基本类型名称作为列名称。

Note, I use the column name day instead of date. I never use basic type names as column names.

在子查询 sub 中,我从下一行获取日期对于具有窗口功能 lead()的每个项目,使用内置选项在没有下一行的情况下默认提供今天。

以此形成一个 daterange 并将其与重叠运算符&& 的输入进行匹配,使用交集运算符 * 计算得出的日期范围。

In the subquery sub I fetch the day from the next row for each item with the window function lead(), using the built-in option to provide "today" as default where there is no next row.
With this I form a daterange and match it against the input with the overlap operator &&, computing the resulting date range with the intersection operator *.

此处的所有范围均带有专有上边界。这就是为什么我要在输入范围内增加一天的原因。这样,我们可以简单地从上限(范围)中减去下限(范围)以获得天数。

All ranges here are with exclusive upper border. That's why I add one day to the input range. This way we can simply subtract lower(range) from upper(range) to get the number of days.

我认为昨天是具有可靠数据的最新一天。 Today在现实生活中仍然可以更改。因此,我将今天( now():: date )用作开放范围的唯一上边界。

I assume that "yesterday" is the latest day with reliable data. "Today" can still change in a real life application. Consequently, I use "today" (now()::date) as exclusive upper border for open ranges.

我提供了两个结果:


  • 您的结果同意您显示的结果。

    您可以无条件地将日期范围除以天数。例如,如果某项仅在最后一天列出,那么您的平均值就会非常低(误导!)。

  • your_result agrees with your displayed results.
    You divide by the number of days in your date range unconditionally. For instance, if an item is only listed for the last day, you get a very low (misleading!) "average".

my_result 计算相同或更高的数字。

我除以实际列出项目的天数。例如,如果某项仅在最后一天列出,我将列出的值作为平均值返回。

my_result computes the same or higher numbers.
I divide by the actual number of days an item is listed. For instance, if an item is only listed for the last day, I return the listed value as average.

到了解差异,我添加了列出项目的天数: days_in_range

To make sense of the difference I added the number of days the item was listed: days_in_range

SQL小提琴

对于此类数据,旧行通常不会发生变化。对于物化视图,这将是一个很好的例子:

For this kind of data, old rows typically don't change. This would make an excellent case for a materialized view:

CREATE MATERIALIZED VIEW mv_stock AS
SELECT store_id, product_id, value
     , daterange (day, lead(day, 1, now()::date) OVER (PARTITION BY store_id, product_id
                                                       ORDER BY day)) AS day_range
FROM   stock;

然后您可以添加支持相关运算符& 的GiST索引:

Then you can add a GiST index which supports the relevant operator &&:

CREATE INDEX mv_stock_range_idx ON mv_stock USING gist (day_range);



大测试用例



我运行了一个具有200k行的更实际的测试。使用MV进行查询的速度大约是其6倍,而后者则是@Joop查询的10倍左右。性能在很大程度上取决于数据分发。 MV对于大型表和高频率条目最有帮助。此外,如果表中的列与此查询无关,则MV可能会更小。有关成本与收益的问题。

Big test case

I ran a more realistic test with 200k rows. The query using the MV was about 6 times as fast, which in turn was ~ 10x as fast as @Joop's query. Performance heavily depends on data distribution. An MV helps most with big tables and high frequency of entries. Also, if the table has columns that are not relevant to this query, a MV can be smaller. A question of cost vs. gain.

我把到目前为止(并改编过)发布的所有解决方案都摆在了摆弄着:

I've put all solutions posted so far (and adapted) in a big fiddle to play with:

具有大型测试用例的SQL Fiddle。

只有40k行的SQL提琴-以避免sqlfiddle.com上的超时

SQL Fiddle with big test case.
SQL Fiddle with only 40k rows - to avoid timeout on sqlfiddle.com

这篇关于平均库存历史记录表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆