在scikit-learn中从pyspark复制logistic回归模型 [英] Replicate logistic regression model from pyspark in scikit-learn
问题描述
问题:鉴于默认参数值,pyspark和scikit-learn中Logistic回归模型的默认实现(未设置自定义参数)似乎会产生不同的结果.
Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values.
我正在尝试复制通过pypark执行的逻辑回归(未设置自定义参数)的结果(请参阅:
I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
在我看来,这两个模型实现(在pyspark和scikit中)不具有相同的参数,因此我不能简单地匹配scikit中的参数以使其适合pyspark中的参数.关于如何在默认配置下匹配这两种型号,有什么解决方案吗?
It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. Is there any solution on how to match both models on their default configuration?
参数Scikit模型(默认参数):
Parameters Scikit model (default parameters):
`LogisticRegression(
C=1.0,
class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
max_iter=100,
multi_class='ovr',
n_jobs=1,
penalty='l2',
random_state=None,
solver='liblinear',
tol=0.0001,
verbose=0,
warm_start=False`
参数Pyspark模型(默认参数):
Parameters Pyspark model (default parameters):
LogisticRegression(self,
featuresCol="features",
labelCol="label",
predictionCol="prediction",
maxIter=100,
regParam=0.0,
elasticNetParam=0.0,
tol=1e-6,
fitIntercept=True,
threshold=0.5,
thresholds=None,
probabilityCol="probability",
rawPredictionCol="rawPrediction",
standardization=True,
weightCol=None,
aggregationDepth=2,
family="auto")
非常感谢!
推荐答案
到现在,我发现参数 standardization = True
所表示的pyspark确实对模型中的数据进行了标准化,而scikit却没有.t.在应用scikit模型之前实施 preprocessing.scale
可以使我对这两个模型都具有非常匹配的结果
By now I figured out that as indicated by the parameter standardization=True
pyspark does standardize the data within the model whereas scikit doesn't. Implementing preprocessing.scale
before applying the scikit model gave me close matching results for both models
这篇关于在scikit-learn中从pyspark复制logistic回归模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!