有没有办法适当地调整这个逻辑回归函数,以说明多个自变量和固定效应? [英] Is there a way to suitably adjust this sklearn logistic regression function to account for multiple independent variables and fixed effects?
本文介绍了有没有办法适当地调整这个逻辑回归函数,以说明多个自变量和固定效应?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想对下面包含的LogitRegress函数进行修改,以包括其他自变量和固定效果。
以下代码改编自此处提供的答案:how to use sklearn when target variable is a proportion
from sklearn.linear_model import LinearRegression
from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
### 1. Original version with a single independent variable
# generate example data
np.random.seed(42)
n = 100
## orig version provided in the link - single random independent variable
x = np.random.randn(n).reshape(-1,1)
# defining the predictor (dependent) variable (a proportional value between 0 and 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
# applying the model - this works
model = LogitRegression()
model.fit(x, p)
### 2. Adding additional independent variables and a fixed effects variable
# creating 3 random independent variables
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)
# a fixed effects variable
cats = ["".join(choices(["France","Norway","Ireland"])) for _ in range(100)]
# combining these into a dataframe
df = pd.DataFrame({"x1":x1,"x2":x2,"x3":x3,"countries":cats})
# adding the fixed effects country columns
df = pd.concat([df,pd.get_dummies(df.countries)],axis=1)
print(df)
# ideally I would like to use the independent variables x1,x2,x3 and the fixed
# effects column, countries, from the above df but I'm not sure how best to edit the
# LogitRegression class to account for this. The dependent variable is a proportion.
# x = np.array(df)
model = LogitRegression()
model.fit(x, p)
我希望预测输出的比例介于0和1之间。我以前尝试过sklearn线性回归方法,但这给出了超出预期范围的预测。我还研究了使用statsModels OLS函数,但尽管我可以包括多个自变量,但我找不到一种方法来包括固定的效果。
事先感谢您在这方面提供的任何帮助,或者请告诉我是否有其他合适的方法可以替代。
推荐答案
我在使用数据框将独立和固定的效果变量传递给函数时,通过以下小的调整成功地解决了这个问题(写出一个问题的简化示例对我找到答案有很大帮助):
from sklearn.linear_model import LinearRegression
from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1,1)
# defining the predictor (dependent) variable (a proportional value between 0 and 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
# creating 3 random independent variables
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)
# a fixed effects variable
cats = ["".join(choices(["France","Norway","Ireland"])) for _ in range(100)]
# combining these into a dataframe
df = pd.DataFrame({"x1":x1,"x2":x2,"x3":x3,"countries":cats})
# adding the fixed effects country columns
df = pd.concat([df,pd.get_dummies(df.countries)],axis=1)
print(df)
# Using the independent variables x1,x2,x3 and the fixed effects column, countries, from the above df. The dependent variable is a proportion.
# x = np.array(df)
categories = df['countries'].unique()
x = df.loc[:,np.concatenate((["x1","x2","x3"],categories))]
model = LogitRegression()
model.fit(x, p)
这篇关于有没有办法适当地调整这个逻辑回归函数,以说明多个自变量和固定效应?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文