Power BI中的多元线性回归 [英] Multiple Linear Regression in Power BI

查看:280
本文介绍了Power BI中的多元线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一组收益,并且我想计算它相对于不同市场指数的beta值.为了有一个具体的例子,让我们在名为Returns的表中使用以下数据集:

Suppose I have a set of returns and I want to compute its beta values versus different market indices. Let's use the following set of data in a table named Returns for the sake of having a concrete example:

  Date       Equity  Duration  Credit  Manager
-----------------------------------------------
01/31/2017   2.907%   0.226%   1.240%   1.78%
02/28/2017   2.513%   0.493%   1.120%   3.88%
03/31/2017   1.346%  -0.046%  -0.250%   0.13%
04/30/2017   1.612%   0.695%   0.620%   1.04%
05/31/2017   2.209%   0.653%   0.480%   1.40%
06/30/2017   0.796%  -0.162%   0.350%   0.63%
07/31/2017   2.733%   0.167%   0.830%   2.06%
08/31/2017   0.401%   1.083%  -0.670%   0.29%
09/30/2017   1.880%  -0.857%   1.430%   2.04%
10/31/2017   2.151%  -0.121%   0.510%   2.33%
11/30/2017   2.020%  -0.137%  -0.020%   3.06%
12/31/2017   1.454%   0.309%   0.230%   1.28%

现在在Excel中,我只能使用LINEST函数来获取beta值:

Now in Excel, I can just use the LINEST function to get the beta values:

= LINEST(Returns[Manager], Returns[[Equity]:[Credit]], TRUE, TRUE)

它吐出一个看起来像这样的数组:

It spits out an array that looks like this:

0.077250253 -0.184974002  0.961578127 -0.001063971
0.707796954  0.60202895   0.540811546  0.008257129
0.50202386   0.009166729  #N/A         #N/A
2.688342242  8            #N/A         #N/A
0.000677695  0.000672231  #N/A         #N/A

测试版位于第一行,使用它们可以得出以下线性估算值:

The betas are in the top row and using them gives me the following linear estimate:

Manager = 0.962 * Equity - 0.185 * Duration + 0.077 * Credit - 0.001

问题是如何使用DAX(最好无需编写自定义R脚本)在Power BI中获得这些值?

The question is how can I get these values in Power BI using DAX (preferably without having to write a custom R script)?

对于针对简单线性回归的一栏,我可以回到数学定义并写最小二乘实现类似于这篇文章.

For simple linear regression against one column, I can go back to the mathematical definition and write a least squares implementation similar to the one given in this post.

但是,当涉及更多列时(我需要最多可以处理12列,但数量不一定总是相同),这很快就会变得很混乱,我希望有一种更好的方法.

However, when more columns become involved (I need to be able to do up to 12 columns, but not always the same number), this gets messy really quickly and I'm hoping there's a better way.

推荐答案

要点:

DAX不是要走的路.使用Home > Edit Queries,然后使用Transform > Run R Script.插入以下R代码段,以使用表中的所有可用变量运行回归分析:

DAX is not the way to go. Use Home > Edit Queries and then Transform > Run R Script. Insert the following R snippet to run a regression analysis using all available variables in a table:

model <- lm(Manager ~ . , dataset)
df<- data.frame(coef(model))
names(df)[names(df)=="coef.model."] <- "coefficients"
df['variables'] <- row.names(df)

Manager编辑为其他任何可用的变量名称,以更改因变量.

Edit Manager to any of the other available variable names to change the dependent variable.

详细信息:

好问题!为什么Microsoft没有引入更灵活的解决方案是我无法理解的.但是目前,如果在Power BI中不使用R,您将找不到非常好的方法.

Good question! Why Microsoft has not introduced more flexible solutions is beyond my understanding. But at the time being, you won't be able to find very good approaches without using R in Power BI.

因此,我建议的方法将忽略您的以下要求:

My suggested approach will therefore ignore your request regarding:

问题是如何使用DAX在Power BI中获得这些值 (最好不必编写自定义R脚本)?

The question is how can I get these values in Power BI using DAX (preferably without having to write a custom R script)?

但是我的答案将满足您的以下要求:

My answer will however meet your requirements regarding:

一个好的答案应该概括为3列以上(可能是 使用索引作为值在未透视的数据表上工作 比列标题).

A good answer should generalize to more than 3 columns (probably by working on an unpivoted data table with the indices as values rather than column headers).

我们在这里:

我在使用逗号作为小数点分隔符的系统上,因此我将使用以下内容作为数据源(如果将数字直接复制到Power BI中,则将不保持列分隔.您首先将其粘贴到Excel中,然后再次复制,然后将其粘贴到Power BI中,这些列就可以了)

I'm on a system using comma as a decimal separator, so I'm going to be using the following as the data source (If you copy the numbers directly into Power BI, the column separation will not be maintained. If you first paste it into Excel, copy it again and THEN paste it into Power BI the columns will be fine):

Date    Equity  Duration    Credit  Manager
31.01.2017  2,907   0,226   1,24    1,78
28.02.2017  2,513   0,493   1,12    3,88
31.03.2017  1,346   -0,046  -0,25   0,13
30.04.2017  1,612   0,695   0,62    1,04
31.05.2017  2,209   0,653   0,48    1,4
30.06.2017  0,796   -0,162  0,35    0,63
31.07.2017  2,733   0,167   0,83    2,06
31.08.2017  0,401   1,083   -0,67   0,29
30.09.2017  1,88    -0,857  1,43    2,04
31.10.2017  2,151   -0,121  0,51    2,33
30.11.2017  2,02    -0,137  -0,02   3,06
31.12.2017  1,454   0,309   0,23    1,28

从Power BI从头开始(出于再现目的),我正在使用Enter Data插入数据:

Starting from scratch in Power BI (for reproducibility purposes) I'm inserting the data using Enter Data:

现在,转到Edit Queries > Edit Queries并检查您是否具有以下内容:

Now, go to Edit Queries > Edit Queries and check that you have this:

为了在分析中包含的列数方面保持灵活性,我发现最好删除日期列.这不会影响您的回归结果.只需右键单击日期"列,然后选择Remove:

In order to maintain flexibility with regards to the number of columns to include in your analysis, I find it is best to remove the Date Column. This will not have an impact on your regression results. Simply right-click the Date column and select Remove:

请注意,这将在>下添加一个新步骤:

Notice that this will add a new step under Query Settings > Applied Steps>:

这是您将能够编辑我们将要使用的几行R代码的地方.现在,转到Transform > Run R Script打开此窗口:

And this is where you are going to be able to edit the few lines of R code we're going to use. Now, go to Transform > Run R Script to open this window:

注意行# 'dataset' holds the input data for this script.值得庆幸的是,您的问题仅涉及一个输入表,因此事情不会变得太复杂(对于多个输入表,请查看

Notice the line # 'dataset' holds the input data for this script. Thankfully, your question is only about ONE input table, so things aren't going to get too complicated (for multiple input tables check out this post). The dataset variable is a variable of the form data.frame in R and is a good (the only..) starting point for further analysis.

插入以下脚本:

model <- lm(Manager ~ . , dataset)
df<- data.frame(coef(model))
names(df)[names(df)=="coef.model."] <- "coefficients"
df['variables'] <- row.names(df)

单击OK,如果一切顺利,您应该这样做:

Click OK, and if all goes well you should end up with this:

单击Table,您将获得以下信息:

Click Table, and you'll get this:

Applied Steps下,您将发现已插入Run R Script步骤.单击右侧的星号(齿轮?)对其进行编辑,或单击df格式化输出表.

Under Applied Steps you'll se that a Run R Script step has been inserted. Click the star (gear ?) on the right to edit it, or click on df to format the output table.

就这样!至少对于 Edit Queries (编辑查询)部分.

This is it! For the Edit Queries part at least.

单击Home > Close & Apply返回Power BI报表部分,并确认您在Visualizations > Fields下有一个新表:

Click Home > Close & Apply to get back to Power BI Report section and verfiy that you have a new table under Visualizations > Fields:

插入表格或矩阵并激活系数和变量以获取此信息:

Insert a Table or Matrix and activate Coefficients and Variables to get this:

我希望这就是您想要的!

I hope this is what you were looking for!

现在有关R脚本的一些详细信息:

Now for some details about the R script:

只要有可能,我就会避免使用许多不同的R库.这样,您就可以减少依赖问题的风险.

As long as it's possible, I would avoid using numerous different R libraries. This way you'll reduce the risk of dependency issues.

函数lm()处理回归分析.在解释变量的数量上获得所需灵活性的关键在于Manager ~ . , dataset部分.这只是说要对数据帧dataset中的Manager变量运行回归分析,并将所有剩余的列~ .用作解释变量. coef(model)部分从估计的模型中提取系数值.结果是一个数据框,其中变量名作为行名.最后一行只是将这些名称添加到数据框本身中.

The function lm() handles the regression analysis. The key to obtain the required flexibilty with regards to the number of explanatory variables lies in the Manager ~ . , dataset part. This simply says to run a regression analysis on the Manager variable in the dataframe dataset, and use all remaining columns ~ . as explanatory variables. The coef(model) part extracts the coefficient values from the estimated model. The result is a dataframe with the variable names as row names. The last line simply adds these names to the dataframe itself.

这篇关于Power BI中的多元线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆