SQL Server中“日期"列的线性回归分析 [英] Linear Regression analysis for Date column in SQL Server

查看:70
本文介绍了SQL Server中“日期"列的线性回归分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码块,使用线性回归(最小二乘法)来计算趋势线的公式.只是找到X和Y轴的R平方和相关系数值.

I have the following block of code that calculates the formula for a trend line using linear regression (method of least squares). It just find the R-Squared and coefficient of correlation value for X and Y axis.

如果X和Y轴为int且为float,则将计算确切值.

This will calculate the exact value if X and Y axis are int and float.

CREATE FUNCTION [dbo].[LinearReqression] (@Data AS XML)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  x = n.value('@x', 'float'),
                y = n.value('@y', 'float')
        FROM @Data.nodes('/r/n') v(n)
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)

用法:

DECLARE @DataTable TABLE (
    SourceID    INT,
    x           Date,
    y           FLOAT
) ;
INSERT INTO @DataTable ( SourceID, x, y )
SELECT ID = 0, x = 1.2, y = 1.0
UNION ALL SELECT 1, 1.6, 1
UNION ALL SELECT 2, 2.0, 1.5
UNION ALL SELECT 3, 2.0, 1.75
UNION ALL SELECT 4, 2.1, 1.85
UNION ALL SELECT 5, 2.1, 2
UNION ALL SELECT 6, 2.2, 3
UNION ALL SELECT 7, 2.2, 3
UNION ALL SELECT 8, 2.3, 3.5
UNION ALL SELECT 9, 2.4, 4
UNION ALL SELECT 10, 2.5, 4
UNION ALL SELECT 11, 3, 4.5 ;

-- Create and view XML data array
DECLARE @DataXML XML ;
SET @DataXML = (
    SELECT  -- FLOAT values are formatted in XML like "1.000000000000000e+000", increasing the character count
            -- Converting them to VARCHAR first keeps the XML small without sacrificing precision
            -- They are unpacked as FLOAT in the function either way
            [@x] = CAST(x AS VARCHAR(20)), 
            [@y] = CAST(y AS VARCHAR(20))
    FROM @DataTable
    FOR XML PATH('n'), ROOT('r') ) ;

SELECT @DataXML ;

-- Get the results
SELECT * FROM dbo.LinearReqression (@DataXML) ;

就我而言,X轴也可能是日期列?那么如何使用日期列计算相​​同的回归分析?

In my case X axis may be Date column also? So how can I calculate same regression analysis with date columns?

推荐答案

简单的答案是:计算日期的趋势线与计算浮点型趋势线几乎相同.

Short answer is: calculating trend line for dates is pretty much the same as calculating trend line for floats.

对于日期,您可以选择一些开始日期,并将开始日期和日期之间的天数用作X.

For dates you can choose some starting date and use number of days between the starting date and your dates as an X.

我没有检查您的函数本身,我认为那里的公式是正确的.

I didn't check your function itself and I assume that formulas there are correct.

此外,我不明白为什么要从表中生成XML并将其解析回函数内部的表中.这是相当低效的.您只需传递表即可.

Also, I don't understand why you generate XML out of the table and parse it back into the table inside the function. It is rather inefficient. You can simply pass the table.

我使用您的函数进行了两种变体:用于处理浮点数和用于处理日期. 在此示例中,我使用的是SQL Server 2008.

I used your function to make two variants: for processing floats and for processing dates. I'm using SQL Server 2008 for this example.

首先创建一个用户定义的表类型,所以我们可以将表传递给函数:

At first create a user-defined table type, so we could pass a table into the function:

CREATE TYPE [dbo].[FloatRegressionDataTableType] AS TABLE(
    [x] [float] NOT NULL,
    [y] [float] NOT NULL
)
GO

然后创建接受该表的函数:

Then create the function that accepts such table:

CREATE FUNCTION [dbo].[LinearRegressionFloat] (@ParamData dbo.FloatRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  x,
                y
        FROM @ParamData
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)
GO

非常类似地,为带有日期的表格创建类型:

Very similarly, create a type for table with dates:

CREATE TYPE [dbo].[DateRegressionDataTableType] AS TABLE(
    [x] [date] NOT NULL,
    [y] [float] NOT NULL
)
GO

并创建一个接受此类表的函数.对于每个给定日期,它使用DATEDIFF计算2001-01-01和给定日期x之间的天数,然后将结果强制转换为浮点数,以确保其余计算正确.您可以尝试将演员表移至浮动状态,然后看到不同的结果.您可以选择其他任何开始日期,而不必是2001-01-01.

And create a function that accepts such table. For each given date it calculates the number of days between 2001-01-01 and the given date x using DATEDIFF and then casts the result to float to make sure that the rest of calculations is correct. You can try to remove the cast to float and you'll see the different result. You can choose any other starting date, it doesn't have to be 2001-01-01.

CREATE FUNCTION [dbo].[LinearRegressionDate] (@ParamData dbo.DateRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  CAST(DATEDIFF(day, '2001-01-01', x) AS float) AS x,
                y
        FROM @ParamData
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)
GO

这是测试功能的方法:

-- test float data
DECLARE @FloatDataTable [dbo].[FloatRegressionDataTableType];

INSERT INTO @FloatDataTable (x, y)
VALUES
(1.2, 1.0)
,(1.6, 1)
,(2.0, 1.5)
,(2.0, 1.75)
,(2.1, 1.85)
,(2.1, 2)
,(2.2, 3)
,(2.2, 3)
,(2.3, 3.5)
,(2.4, 4)
,(2.5, 4)
,(3, 4.5);

SELECT * FROM dbo.LinearRegressionFloat(@FloatDataTable);


-- test date data
DECLARE @DateDataTable [dbo].[DateRegressionDataTableType];

INSERT INTO @DateDataTable (x, y)
VALUES
 ('2001-01-13', 1.0)
,('2001-01-17', 1)
,('2001-01-21', 1.5)
,('2001-01-21', 1.75)
,('2001-01-22', 1.85)
,('2001-01-22', 2)
,('2001-01-23', 3)
,('2001-01-23', 3)
,('2001-01-24', 3.5)
,('2001-01-25', 4)
,('2001-01-26', 4)
,('2001-01-31', 4.5);

SELECT * FROM dbo.LinearRegressionDate(@DateDataTable);

这是两个结果集:

r_squared            Alpha                Beta
----------------------------------------------------------
0.798224907472009    -2.66524390243902    2.46417682926829


r_squared            Alpha                Beta
----------------------------------------------------------
0.79822490747201     -2.66524390243902    0.246417682926829

这篇关于SQL Server中“日期"列的线性回归分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆