优化变化的变量以获得多列的最大皮尔逊相关系数 [英] Optimize changing variables to get max Pearson's correlation coefficient for multiple columns

查看：172 发布时间：2020/10/10 1:42:18 python scipy correlation minimization scipy-optimize

本文介绍了优化变化的变量以获得多列的最大皮尔逊相关系数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

修改：

如果我有一个包含5列的pandas DataFrame Col1 & Col2 & Col3 & Col4 & Col5 ，我需要获得（（$code> Col2 ， Col3 ）& （ Col2 ， Col4 ）& （ Col2 ， Col5 ），请考虑 Col1 中的值

If I have a pandas DataFrame that includes 5 columns Col1 & Col2 & Col3 & Col4 & Col5 and I need to get max Pearson's correlation coefficient between(Col2,Col3) & (Col2,Col4) & (Col2,Col5) by considering the values in Col1

通过下一个公式获得的 Col2 的修改值：

The modified values for Col2 which obtained by the next formula:

df['Col1']=np.power((df['Col1']),B)
df['Col2']=df['Col2']*df['Col1']

其中 B 是变化的变量（单个值），用于获得最大皮尔逊之间的相关系数（新值 Col2 ， Col3 ）& （ Col2 ， Col4 的新值）& （ Col2 ， Col5 的新值）。

where B is the changing variable (a single value) to get max Pearson's correlation coefficient between (new values of Col2,Col3) & (new values of Col2,Col4) & (new values of Col2,Col5).

更新：

上表包含5列如上所述，（ Col2 ， Col3 ）与系数之间的相关性（ Col2 ， Col4 ）&表格下方显示了（ Col2 ， Col5 ）。

The above table containing 5 columns as I mentioned above, the correlation between coefficient between (Col2,Col3) & (Col2,Col4) & (Col2,Col5) is illustrated below the table.

我需要基于两个提到的方程式更改 Col2 的值，其中变化的值是 B 。

I need to change the values of Col2 based on two the mentioned equations where the changing value is B.

所以问题是如何获得 B 给出一个新的相关系数，该系数大于或等于对应的旧系数？


So the question is how to get the best value of B that gives a new correlation coefficient greater than or equal its counterpart(old)? 
  
更新2：
 Col1，Col2，Col3，Col4，Col5 
Col1,Col2,Col3,Col4,Col5
 2,0.051361397,2618,1453,1099 
2,0.051361397,2618,1453,1099
 4,0.053507779,306,153,150 
4,0.053507779,306,153,150
 2,0.041236151,39,54,34 
2,0.041236151,39,54,34
 6,0.094526419,2755,2209,1947 
6,0.094526419,2755,2209,1947
 4,0.079773397,2313,1261,1022 
4,0.079773397,2313,1261,1022
 4,0.083891415,3528,2502,2029 
4,0.083891415,3528,2502,2029
 6,0.090737243,3594,2781,2508 
6,0.090737243,3594,2781,2508
 2， 0.069552772,370,234,246 
2,0.069552772,370,234,246
 2,0.052401789,690,402,280 
2,0.052401789,690,402,280
 2,0.039930675,1218,846,631 
2,0.039930675,1218,846,631
 4,0.065952096,1706,523,453 
4,0.065952096,1706,523,453
 2,0.053064126,314,197,123 
2,0.053064126,314,197,123
 6,0.076847486,4019,1675,1452 
6,0.076847486,4019,1675,1452
 2,0.044881545,604,402,356 
2,0.044881545,604,402,356
 2,0.073102611， 2214,1263,1050 
2,0.073102611,2214,1263,1050
 0,0.046998526,938,648,572 
0,0.046998526,938,648,572
解决方案
不是很优雅，但是可以工作。随意使它更通用：
Not extremely elegant, but works; feel free to make this more generic:
import pandas as pd
from scipy.optimize import minimize


def minimize_me(b, df):

    # we want to maximize, so we have to multiply by -1
    return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )

# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')

# B is greater than 0 for now
bnds = [(0, None)]

res = minimize(minimize_me, (1), args=(df,), bounds=bnds)

if res.success:
    # that's the optimal B
    print(res.x[0])

    # that's the highest correlation you can get
    print(-1 * res.fun)
else:
    print("Sorry, the optimization was not successful. Try with another initial"
          " guess or optimization method")

这将打印：
0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)

我现在从剪贴板中读取，将其替换为您的 .csv 文件。然后，您还应该避免对列进行硬编码。上面的代码仅用于演示目的，因此您将了解如何自行设置优化问题。
I now read from clipboard, replace that by your .csv file. You should then also avoid the hardcoding of the columns; the code above is just for demonstration purposes, so that you see how to set up the optimization problem itself.
如果您对总和感兴趣，可以使用（其余代码段未修改）：
If you are interested in the sum, you can use (rest of code unmodified):
def minimize_me(b, df):

    col_mod = df['Col2'] * df['Col1'] ** b

    # we want to maximize, so we have to multiply by -1
    return -1 * (df['Col3'].corr(col_mod) +
                 df['Col4'].corr(col_mod) +
                 df['Col5'].corr(col_mod))

这将打印：
1.0452394748131613
2.3428368479642137


                        这篇关于优化变化的变量以获得多列的最大皮尔逊相关系数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

优化变化的变量以获得多列的最大皮尔逊相关系数 [英] Optimize changing variables to get max Pearson's correlation coefficient for multiple columns

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

优化变化的变量以获得多列的最大皮尔逊相关系数 [英] Optimize changing variables to get max Pearson&#39;s correlation coefficient for multiple columns

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

优化变化的变量以获得多列的最大皮尔逊相关系数 [英] Optimize changing variables to get max Pearson's correlation coefficient for multiple columns

登录关闭