优化变化的变量以获得多列的最大皮尔逊相关系数 [英] Optimize changing variables to get max Pearson's correlation coefficient for multiple columns
问题描述
修改:
如果我有一个包含5列的pandas DataFrame Col1
& Col2
& Col3
& Col4
& Col5
,我需要获得(($code> Col2 , Col3
)& ( Col2
, Col4
)& ( Col2
, Col5
),请考虑 Col1
中的值
If I have a pandas DataFrame that includes 5 columns Col1
& Col2
& Col3
& Col4
& Col5
and I need to get max Pearson's correlation coefficient between(Col2
,Col3
) & (Col2
,Col4
) & (Col2
,Col5
) by considering the values in Col1
通过下一个公式获得的 Col2
的修改值:
The modified values for Col2
which obtained by the next formula:
df['Col1']=np.power((df['Col1']),B)
df['Col2']=df['Col2']*df['Col1']
其中 B
是变化的变量(单个值),用于获得最大皮尔逊之间的相关系数(新值 Col2
, Col3
)& ( Col2
, Col4
的新值)& ( Col2
, Col5
的新值)。
where B
is the changing variable (a single value) to get max Pearson's correlation coefficient between (new values of Col2
,Col3
) & (new values of Col2
,Col4
) & (new values of Col2
,Col5
).
更新:
上表包含5列如上所述,( Col2
, Col3
)与系数之间的相关性( Col2
, Col4
)&表格下方显示了( Col2
, Col5
)。
The above table containing 5 columns as I mentioned above, the correlation between coefficient between (Col2
,Col3
) & (Col2
,Col4
) & (Col2
,Col5
) is illustrated below the table.
我需要基于两个提到的方程式更改 Col2
的值,其中变化的值是 B
。
I need to change the values of Col2
based on two the mentioned equations where the changing value is B
.
所以问题是如何获得 B $ c $的最佳价值c>给出一个新的相关系数,该系数大于或等于对应的旧系数?
So the question is how to get the best value of B
that gives a new correlation coefficient greater than or equal its counterpart(old)?
更新2:
Col1,Col2,Col3,Col4,Col5
Col1,Col2,Col3,Col4,Col5
2,0.051361397,2618,1453,1099
2,0.051361397,2618,1453,1099
4,0.053507779,306,153,150
4,0.053507779,306,153,150
2,0.041236151,39,54,34
2,0.041236151,39,54,34
6,0.094526419,2755,2209,1947
6,0.094526419,2755,2209,1947
4,0.079773397,2313,1261,1022
4,0.079773397,2313,1261,1022
4,0.083891415,3528,2502,2029
4,0.083891415,3528,2502,2029
6,0.090737243,3594,2781,2508
6,0.090737243,3594,2781,2508
2, 0.069552772,370,234,246
2,0.069552772,370,234,246
2,0.052401789,690,402,280
2,0.052401789,690,402,280
2,0.039930675,1218,846,631
2,0.039930675,1218,846,631
4,0.065952096,1706,523,453
4,0.065952096,1706,523,453
2,0.053064126,314,197,123
2,0.053064126,314,197,123
6,0.076847486,4019,1675,1452
6,0.076847486,4019,1675,1452
2,0.044881545,604,402,356
2,0.044881545,604,402,356
2,0.073102611, 2214,1263,1050
2,0.073102611,2214,1263,1050
0,0.046998526,938,648,572
0,0.046998526,938,648,572
不是很优雅,但是可以工作。随意使它更通用:
Not extremely elegant, but works; feel free to make this more generic:
import pandas as pd
from scipy.optimize import minimize
def minimize_me(b, df):
# we want to maximize, so we have to multiply by -1
return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )
# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')
# B is greater than 0 for now
bnds = [(0, None)]
res = minimize(minimize_me, (1), args=(df,), bounds=bnds)
if res.success:
# that's the optimal B
print(res.x[0])
# that's the highest correlation you can get
print(-1 * res.fun)
else:
print("Sorry, the optimization was not successful. Try with another initial"
" guess or optimization method")
这将打印:
0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)
我现在从剪贴板
中读取,将其替换为您的 .csv
文件。然后,您还应该避免对列进行硬编码。上面的代码仅用于演示目的,因此您将了解如何自行设置优化问题。
I now read from clipboard
, replace that by your .csv
file. You should then also avoid the hardcoding of the columns; the code above is just for demonstration purposes, so that you see how to set up the optimization problem itself.
如果您对总和感兴趣,可以使用(其余代码段未修改):
If you are interested in the sum, you can use (rest of code unmodified):
def minimize_me(b, df):
col_mod = df['Col2'] * df['Col1'] ** b
# we want to maximize, so we have to multiply by -1
return -1 * (df['Col3'].corr(col_mod) +
df['Col4'].corr(col_mod) +
df['Col5'].corr(col_mod))
这将打印:
1.0452394748131613
2.3428368479642137
这篇关于优化变化的变量以获得多列的最大皮尔逊相关系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!