如何在 BigQuery 中执行线性回归? [英] How to perform linear regression in BigQuery?

查看:23
本文介绍了如何在 BigQuery 中执行线性回归?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BigQuery 有一些统计聚合函数,例如 STDDEV(X) 和 CORR(X, Y),但它不提供直接执行线性回归的函数.

BigQuery has some statistical aggregation functions such as STDDEV(X) and CORR(X, Y), but it doesn't offer functions to directly perform linear regression.

如何使用存在的函数计算线性回归?

How can one compute a linear regression using the functions that do exist?

推荐答案

Editor's edit: 请参阅下一个答案,BigQuery 现在原生支持线性回归.--Fh

Editor's edit: Please see next answer, linear regression is now natively supported in BigQuery. --Fh

以下查询使用数值稳定且易于修改以适用于任何输入表的计算执行线性回归.它使用内置函数 CORR 生成最适合模型 Y = SLOPE * X + INTERCEPT 和 Pearson 相关系数的斜率和截距.

The following query performs a linear regression using calculations that are numerically stable and easily modified to work over any input table. It produces the slope and intercept of the best fit to the model Y = SLOPE * X + INTERCEPT and the Pearson correlation coefficient using the builtin function CORR.

例如,我们使用公共出生数据集来计算出生体重,作为怀孕持续时间的线性函数,按州细分.你可以写得更紧凑,但我们使用多层子查询来突出这些部分是如何组合在一起的.要将其应用于另一个数据集,您只需替换最里面的查询.

As an example, we use the public natality dataset to compute birth weight as a linear function of the duration of pregnancy, broken down by state. You could write this more compactly, but we use several layers of subqueries to highlight how the pieces go together. To apply this to another dataset, you just need to replace the innermost query.

SELECT Bucket,
       SLOPE,
       (SUM_OF_Y - SLOPE * SUM_OF_X) / N AS INTERCEPT,
       CORRELATION
FROM (
    SELECT Bucket,
           N,
           SUM_OF_X,
           SUM_OF_Y,
           CORRELATION * STDDEV_OF_Y / STDDEV_OF_X AS SLOPE,
           CORRELATION
    FROM (
        SELECT Bucket,
               COUNT(*) AS N,
               SUM(X) AS SUM_OF_X,
               SUM(Y) AS SUM_OF_Y,
               STDDEV_POP(X) AS STDDEV_OF_X,
               STDDEV_POP(Y) AS STDDEV_OF_Y,
               CORR(X,Y) AS CORRELATION
        FROM (SELECT state AS Bucket,
                     gestation_weeks AS X,
                     weight_pounds AS Y
              FROM [publicdata.samples.natality])
        WHERE Bucket IS NOT NULL AND
              X IS NOT NULL AND
              Y IS NOT NULL
        GROUP BY Bucket));

与对 X 和 Y 的乘积求和然后求差和除法相比,使用 STDDEV_POP 和 CORR 函数提高了此查询的数值稳定性,但是如果您在表现良好的数据集上使用这两种方法,则可以验证它们以高精度产生相同的结果.

Using the STDDEV_POP and CORR functions improves the numerical stability of this query compared to summing up products of X and Y and then taking differences and dividing, but if you use both approaches on a well-behaved dataset, you can verify that they produce the same results to high accuracy.

这篇关于如何在 BigQuery 中执行线性回归?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆