如何将 VectorAssembler 的输出中的特征映射回 Spark ML 中的列名? [英] How to map features from the output of a VectorAssembler back to the column names in Spark ML?

查看:53
本文介绍了如何将 VectorAssembler 的输出中的特征映射回 Spark ML 中的列名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 PySpark 中运行线性回归,并且我想创建一个包含汇总统计信息的表格,例如数据集中每一列的系数、P 值和 t 值.但是,为了训练线性回归模型,我必须使用 Spark 的 VectorAssembler 创建一个特征向量,现在对于每一行,我都有一个特征向量和目标列.当我尝试访问 Spark 的内置回归汇总统计数据时,它们为我提供了每个统计数据的原始数字列表,并且无法知道哪个属性对应哪个值,这真的很难手动找出大量的列.如何将这些值映射回列名?

例如,我的当前输出是这样的:

<块引用>

系数:[-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.28277304967,-8207304967,-820,789,-8982,759.75926,759.725

P 值:[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0]

t统计量:[-23.348593508995318,-44.72813283953004,19.836508234714472,144.49248881747755,-16.547272230754242,-9.560681351483941,-19.563547400189073,1.3228378389036228,1.0912415361190977,20.383256127350474]

系数的标准误差:[8.043646497811427,4.182131353367049,4.293682291754585,73.32793120907755,7.690626652102948,4.108783841348964,61.669402913526625,25.481445101737247,91.63478289909655,609.7007361468519]

除非我知道它们对应的属性,否则这些数字没有任何意义.但是在我的 DataFrame 中,我只有一个名为功能"的列.其中包含稀疏向量行.

当我有 one-hot 编码特征时,这是一个更大的问题,因为如果我有一个长度为 n 的编码的变量,我将得到 n 个相应的系数/p 值/t 值等.

解决方案

截至今天,Spark 不提供任何可以为您完成的方法,因此如果您必须创建自己的方法.假设您的数据如下所示:

随机导入随机种子(1)df = sc.parallelize([(random.choice([0.0, 1.0]),random.choice(["a", "b", "c"]),random.choice(["foo", "bar"]),random.randint(0, 100),random.random(),) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

并使用以下管道进行处理:

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler从 pyspark.ml 导入管道从 pyspark.ml.regression 导入 LinearRegression索引器 = [StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]]编码器 = [OneHotEncoder(inputCol=idx.getOutputCol(),outputCol="{0}_enc".format(idx.getOutputCol())) 用于索引器中的 idx]汇编程序 = VectorAssembler(inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"],outputCol="功能")管道 = 管道(阶段=索引器+编码器+[汇编器,LinearRegression()])模型 = 管道.拟合(df)

获取LinearRegressionModel:

lrm = model.stages[-1]

转换数据:

transformed = model.transform(df)

提取和展平 ML 属性:

来自 itertools 导入链attrs = 排序((attr["idx"], attr["name"]) for attr in (chain(*transformed).schema[lrm.summary.featuresCol].metadata["ml_attr"]["attrs"].values())))

并映射到输出:

[(name, lrm.summary.pValues[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.26400012641279824),('x1_idx_enc_c', 0.06320192217171572),('x2_idx_enc_foo', 0.40447778902400433),('x3', 0.1081883594783335),('x4', 0.4545851609776568)]

[(name, lrm.coefficients[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.13874401585637453),('x1_idx_enc_c', 0.23498565469334595),('x2_idx_enc_foo', -0.083558932128022873),('x3', 0.0030186112903237442),('x4', -0.12951394186593695)]

I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset. However, in order to train a linear regression model I had to create a feature vector using Spark's VectorAssembler, and now for each row I have a single feature vector and the target column. When I try to access Spark's in-built regression summary statistics, they give me a very raw list of numbers for each of these statistics, and there's no way to know which attribute corresponds to which value, which is really difficult to figure out manually with a large number of columns. How do I map these values back to the column names?

For example, I have my current output as something like this:

Coefficients: [-187.807832407,-187.058926726,85.1716641376,10595.3352802,-127.258892837,-39.2827730493,-1206.47228704,33.7078197705,99.9956812528]

P-Value: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18589731365614548, 0.275173571416679, 0.0]

t-statistic: [-23.348593508995318, -44.72813283953004, 19.836508234714472, 144.49248881747755, -16.547272230754242, -9.560681351483941, -19.563547400189073, 1.3228378389036228, 1.0912415361190977, 20.383256127350474]

Coefficient Standard Errors: [8.043646497811427, 4.182131353367049, 4.293682291754585, 73.32793120907755, 7.690626652102948, 4.108783841348964, 61.669402913526625, 25.481445101737247, 91.63478289909655, 609.7007361468519]

These numbers mean nothing unless I know which attribute they correspond to. But in my DataFrame I only have one column called "features" which contains rows of sparse Vectors.

This is an ever bigger problem when I have one-hot encoded features, because if I have one variable with an encoding of length n, I will get n corresponding coefficients/p-values/t-values etc.

解决方案

As of today Spark doesn't provide any method that can do it for you, so if you have to create your own. Let's say your data looks like this:

import random
random.seed(1)

df = sc.parallelize([(
    random.choice([0.0, 1.0]), 
    random.choice(["a", "b", "c"]),
    random.choice(["foo", "bar"]),
    random.randint(0, 100),
    random.random(),
) for _ in range(100)]).toDF(["label", "x1", "x2", "x3", "x4"])

and is processed using following pipeline:

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

indexers = [
  StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in ["x1", "x2"]]
encoders = [
    OneHotEncoder(
        inputCol=idx.getOutputCol(),
        outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers]
assembler = VectorAssembler(
    inputCols=[enc.getOutputCol() for enc in encoders] + ["x3", "x4"],
    outputCol="features")

pipeline = Pipeline(
    stages=indexers + encoders + [assembler, LinearRegression()])
model = pipeline.fit(df)

Get the LinearRegressionModel:

lrm = model.stages[-1]

Transform the data:

transformed =  model.transform(df)

Extract and flatten ML attributes:

from itertools import chain

attrs = sorted(
    (attr["idx"], attr["name"]) for attr in (chain(*transformed
        .schema[lrm.summary.featuresCol]
        .metadata["ml_attr"]["attrs"].values())))

and map to the output:

[(name, lrm.summary.pValues[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.26400012641279824),
 ('x1_idx_enc_c', 0.06320192217171572),
 ('x2_idx_enc_foo', 0.40447778902400433),
 ('x3', 0.1081883594783335),
 ('x4', 0.4545851609776568)]

[(name, lrm.coefficients[idx]) for idx, name in attrs]

[('x1_idx_enc_a', 0.13874401585637453),
 ('x1_idx_enc_c', 0.23498565469334595),
 ('x2_idx_enc_foo', -0.083558932128022873),
 ('x3', 0.0030186112903237442),
 ('x4', -0.12951394186593695)]

这篇关于如何将 VectorAssembler 的输出中的特征映射回 Spark ML 中的列名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆