递归SQL语句(PostgreSQL 9.1.4) [英] Recursive SQL statement (PostgreSQL 9.1.4)
问题描述
PostgreSQL 9.1
PostgreSQL 9.1
业务状况
每个月都有一个分配给特定流程的新一批帐户。每批都可以按月份,帐户数和帐户总余额来描述。该过程的目标是从客户那里收回一些余额。
然后每月分别跟踪每个批次(自从批次转移到流程以来,每个月回收的金额)。
Every month, there is a new batch of accounts given to a specific process. Every batch can be described by month, number of accounts and total balance of accounts. The goal of the process is to recover some of the balance back from customers. Each batch is than tracked separately on a monthly basis (amount recovered on each month since batch was transferred to the process).
目标
我的目标是预测将来将回收多少。
My goal is to predict what amount will be recovered in the future.
数据定义
create table vintage_data (
granularity date, /* Month when account entered process*/
distance_in_months integer, /* Distance in months from date when accounts entered process*/
entry_accounts integer, /* Number of accounts that entered process in a given month*/
entry_amount numeric, /* Total amount for account that entered process in a given month*/
recovery_amount numeric /* Amount recovered in Nth month on accounts that entered process in a given month */
);
样本数据
insert into vintage_data values('2012-01-31',1,200,100000,1000);
insert into vintage_data values('2012-01-31',2,200,100000,2000);
insert into vintage_data values('2012-01-31',3,200,100000,3000);
insert into vintage_data values('2012-01-31',4,200,100000,3500);
insert into vintage_data values('2012-01-31',5,200,100000,3400);
insert into vintage_data values('2012-01-31',6,200,100000,3300);
insert into vintage_data values('2012-02-28',1,250,150000,1200);
insert into vintage_data values('2012-02-28',2,250,150000,1600);
insert into vintage_data values('2012-02-28',3,250,150000,1800);
insert into vintage_data values('2012-02-28',4,250,150000,1200);
insert into vintage_data values('2012-02-28',5,250,150000,1600);
insert into vintage_data values('2012-03-31',1,200,90000,1300);
insert into vintage_data values('2012-03-31',2,200,90000,1200);
insert into vintage_data values('2012-03-31',3,200,90000,1400);
insert into vintage_data values('2012-03-31',4,200,90000,1000);
insert into vintage_data values('2012-04-30',1,300,180000,1600);
insert into vintage_data values('2012-04-30',2,300,180000,1500);
insert into vintage_data values('2012-04-30',3,300,180000,4000);
insert into vintage_data values('2012-05-31',1,400,225000,2200);
insert into vintage_data values('2012-05-31',2,400,225000,6000);
insert into vintage_data values('2012-06-30',1,100,60000,1000);
计算过程
您可以将数据想象成一个三角矩阵(X值将被预测):
You can imagine the data as a triangular matrix (X values are to be forecasted):
distance_in_months 1 2 3 4 5 6
granularity entry_accounts entry_amount
2012-01-31 200 100000 1000 2000 3000 3500 3400 3300
2012-02-28 250 150000 1200 1600 1800 1200 1600 (X-1)
2012-03-31 200 90000 1300 1200 1400 1000 (X0) (X4)
2012-04-30 300 180000 1600 1500 4000 (X1) (X5) (X8)
2012-05-31 400 225000 2200 6000 (X2) (X6) (X9) (X11)
2012-06-30 100 60000 1000 (X3) (X7) (X10) (X12 (X13)
算法
我的目标是预测所有遗漏的点(未来)。过程,这是对点X1的计算
The goal I have is to forecast all the missing points (future). To illustrate the process, this is the calculation for the point X1
1)使用距离最大为4来获取前三个月的行总计:
1) Get row totals for previous three months using distance up to 4:
2012-01-31 1000+2000+3000+3500=9500 (d4m3)
2012-02-28 1200+1600+1800+1200=5800 (d4m2)
2012-03-31 1300+1200+1400+1000=4900 (d4m1)
2)使用距离最大为3的距离获取前三个月的行总计:
2) Get row totals for previous three months using distance up to 3:
2012-01-31 1000+2000+3000=6000 (d3m3)
2012-02-28 1200+1600+1800=4600 (d3m2)
2012-03-31 1300+1200+1400=3800 (d3m1)
3)计算距离3和距离4的加权平均运行速度(由entry_amount加权):
3) Calculate weighted average running rate for distance 3 and distance 4 (weighted by entry_amount):
(d4m3+d4m2+d4m1)/(100000+150000+90000) = (9500+5800+4900)/(100000+150000+90000) = 20200/340000 = 0.0594
(d3m3+d3m2+d3m1)/(100000+150000+90000) = (6000+4600+3800)/(100000+150000+90000) = 14400/340000 = 0.0424
4)计算变更距离3和距离4
4) Calculate the change between distance 3 and distance 4
((d4m3+d4m2+d4m1)/(100000+150000+90000))/((d3m3+d3m2+d3m1)/(100000+150000+90000)) =
= (20200/340000)/(14400/340000) =
= 0.0594/0.0424 = 1.403 (PredictionRateForX1)
5)使用不超过3的距离计算预测月份的行总数。
5) Calculate row totals for predicted month using distance up to 3:
2012-04-30 1600+1500+4000=7100
6)使用entry_amount预测月份的费率
6) Calculate rate using entry_amount for predicted month
7100/180000 = 0.0394
7)计算X1的预测速率
7) Calculate rate predicted for X1
0.0394 * PredictionRateForX1 = 0.05534
8)计算X1的金额
(0.05534-0.0394)*180000 = 2869.2
问题
问题是如何使用SQL语句计算矩阵的其余部分(从x-1到x13)。显然,这将需要某种递归算法。
The problem is how to calculate the rest of the matrix (from x-1 to x13) using SQL statement. It is obvious that this will require some sort of recursive algorithm.
推荐答案
这是一项艰巨的任务,将其拆分以使其更复杂。可管理的。我会把它放在带有 RETURN TABLE
的plpgsql函数中:
It's a big task, split it up to make it more manageable. I would put that in a plpgsql function with RETURN TABLE
:
-
使用交叉表查询
为计算过程矩阵创建临时表。您需要tablefunc
模块。运行(每个数据库一次):
Create a temporary table for your "Calculation Process" matrix using a crosstab query You need the
tablefunc
module installed for that. Run (once per database):
CREATE EXTENSION tablefunc;
按字段更新临时表。
Update the temp table field by field.
以下演示功能完整,并已在PostgreSQL 9.1.4中进行了测试。在问题中提供的表定义的基础上:
The following demo is fully functional and tested with PostgreSQL 9.1.4. Building on the table definition provided in the question:
-- DROP FUNCTION f_forcast();
CREATE OR REPLACE FUNCTION f_forcast()
RETURNS TABLE (
granularity date
,entry_accounts numeric
,entry_amount numeric
,d1 numeric
,d2 numeric
,d3 numeric
,d4 numeric
,d5 numeric
,d6 numeric) AS
$BODY$
BEGIN
--== Create temp table with result of crosstab() ==--
CREATE TEMP TABLE matrix ON COMMIT DROP AS
SELECT *
FROM crosstab (
'SELECT granularity, entry_accounts, entry_amount
,distance_in_months, recovery_amount
FROM vintage_data
ORDER BY 1, 2',
'SELECT DISTINCT distance_in_months
FROM vintage_data
ORDER BY 1')
AS tbl (
granularity date
,entry_accounts numeric
,entry_amount numeric
,d1 numeric
,d2 numeric
,d3 numeric
,d4 numeric
,d5 numeric
,d6 numeric
);
ANALYZE matrix; -- update statistics to help calculations
--== Calculations ==--
-- I implemented the first calculation for X1 and leave the rest to you.
-- Can probably be generalized in a loop or even a single statement.
UPDATE matrix m
SET d4 = (
SELECT (sum(x.d1) + sum(x.d2) + sum(x.d3) + sum(x.d4))
/(sum(x.d1) + sum(x.d2) + sum(x.d3)) - 1
-- removed redundant sum(entry_amount) from equation
FROM (
SELECT *
FROM matrix a
WHERE a.granularity < m.granularity
ORDER BY a.granularity DESC
LIMIT 3
) x
) * (m.d1 + m.d2 + m.d3)
WHERE m.granularity = '2012-04-30';
--- Next update X2 ..
--== Return results ==--
RETURN QUERY
TABLE matrix
ORDER BY 1;
END;
$BODY$ LANGUAGE plpgsql;
致电:
SELECT * FROM f_forcast();
我已经简化了很多,删除了计算中的一些多余步骤。
该解决方案采用了多种先进技术。您需要了解使用PostgreSQL的方式。
I have simplified quite a bit, removing some redundant steps in the calculation.
The solution employs a variety of advanced techniques. You need to know your way around PostgreSQL to work with this.
这篇关于递归SQL语句(PostgreSQL 9.1.4)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!