如何为 ML 算法矢量化 DataFrame 列? [英] How to vectorize DataFrame columns for ML algorithms?

查看:27
本文介绍了如何为 ML 算法矢量化 DataFrame 列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个带有一些分类字符串值的 DataFrame(例如 uuid|url|browser).

have a DataFrame with some categorical string values (e.g uuid|url|browser).

我想将其转换为双精度以执行接受双矩阵的 ML 算法.

I would to convert it in a double to execute an ML algorithm that accept double matrix.

作为转换方法,我使用 StringIndexer (spark 1.4) 将我的字符串值映射到双精度值,所以我定义了一个这样的函数:

As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:

def str(arg: String, df:DataFrame) : DataFrame =
   (
    val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
    val newDF = indexer.fit(df).transform(df)
    return newDF
   )

现在的问题是,我将迭代 df 的每一列,调用此函数并在已解析的双列中添加(或转换)原始字符串列,因此结果将是:

Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:

初始 df:

[String: uuid|String: url| String: browser]

最终 df:

[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]

提前致谢

推荐答案

你可以简单地foldLeft覆盖列的Array:

You can simply foldLeft over the Array of columns:

val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))

不过,我会争辩说这不是一个好方法.由于 src 丢弃了 StringIndexerModel,因此在获取新数据时无法使用它.因此,我建议使用 Pipeline:

Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline:

import org.apache.spark.ml.Pipeline

val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
   cname => new StringIndexer()
     .setInputCol(cname)
     .setOutputCol(s"${cname}_index")
)

// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???

val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
model.transform(df)

VectorAssembler 可以这样包含:

val assembler  = new VectorAssembler()
    .setInputCols(df.columns.map(cname => s"${cname}_index"))
    .setOutputCol("features")

val stages = transformers :+ assembler

您也可以使用 RFormula,它不太可定制,但更简洁:

You could also use RFormula, which is less customizable, but much more concise:

import org.apache.spark.ml.feature.RFormula

val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
val rfModel = rf.fit(dataset)
rfModel.transform(dataset)

这篇关于如何为 ML 算法矢量化 DataFrame 列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆