如何将列声明为 DataFrame 中的分类特征以用于 ml [英] How can I declare a Column as a categorical feature in a DataFrame for use in ml

查看:21
本文介绍了如何将列声明为 DataFrame 中的分类特征以用于 ml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何声明我的 DataFrame 中的给定列包含分类信息?

How can I declare that a given Column in my DataFrame contains categorical information?

我有一个从数据库加载的 Spark SQL DataFrame.这个 DataFrame 中的许多列都有分类信息,但它们被编码为 Longs(为了隐私).

I have a Spark SQL DataFrame which I loaded from a database. Many of the columns in this DataFrame have categorical information, but they are encoded as Longs (for privacy).

我希望能够告诉 spark-ml,即使此列是数值,但信息实际上是分类的.类别的索引可能有一些漏洞,这是可以接受的.(例如,一列可能有值 [1, 0, 0 ,4])

I want to be able to tell spark-ml that even though this column is Numerical the information is actually Categorical. The indexes of categories may have a few holes, and it is acceptable. (Ex. a column may have the values [1, 0, 0 ,4])

我知道存在 StringIndexer 但我更愿意避免编码和解码的麻烦,特别是因为我有很多列都有这种行为.

I am aware that there exists the StringIndexer but I would prefer to avoid the hassle of encoding and decoding, specially because I have many columns that have this behavior.

我会寻找类似于以下内容的内容

I would be looking for something that looks like the following

train = load_from_database()
categorical_cols = ["CategoricalColOfLongs1",
                    "CategoricalColOfLongs2"]
numeric_cols = ["NumericColOfLongs1"]

## This is what I am looking for
## this step detects the min and max value of both columns
## and adds metadata to indicate this as a categorical column
## with (1 + max - min) categories
categorizer = ColumnCategorizer(columns = categorical_cols,
                                autoDetectMinMax = True)
##

vectorizer = VectorAssembler(inputCols = categorical_cols + 
                                         numeric_cols,
                             outputCol = "features")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [categorizer, vectorizer, classifier])
model = pipeline.fit(train)

推荐答案

我宁愿避免编码和解码的麻烦,

I would prefer to avoid the hassle of encoding and decoding,

您无法真正完全避免这种情况.分类变量所需的元数据实际上是值和索引之间的映射.不过,无需手动执行或创建一个自定义变压器.假设您有这样的数据框:

You cannot really avoid this completely. Required metadata for categorical variable is actually a mapping between value and index. Still, there is no need to do it manually or to create a custom transformer. Lets assume you have data frame like this:

import numpy as np
import pandas as pd

df = sqlContext.createDataFrame(pd.DataFrame({
    "x1": np.random.random(1000),
    "x2": np.random.choice(3, 1000),
    "x4": np.random.choice(5, 1000)
}))

你只需要一个汇编器和索引器:

All you need is an assembler and indexer:

from pyspark.ml.feature import VectorAssembler, VectorIndexer
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
    VectorAssembler(inputCols=df.columns, outputCol="features_raw"),
    VectorIndexer(
        inputCol="features_raw", outputCol="features", maxCategories=10)])

transformed = pipeline.fit(df).transform(df)
transformed.schema.fields[-1].metadata

## {'ml_attr': {'attrs': {'nominal': [{'idx': 1,
##      'name': 'x2',
##      'ord': False,
##      'vals': ['0.0', '1.0', '2.0']},
##     {'idx': 2,
##      'name': 'x4',
##      'ord': False,
##      'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']}],
##    'numeric': [{'idx': 0, 'name': 'x1'}]},
##   'num_attrs': 3}}

此示例还显示了您提供哪些类型信息以将向量的给定元素标记为分类变量

This example also shows what type information you provide to mark given element of the vector as categorical variable

{
    'idx': 2,  # Index (position in vector)
    'name': 'x4',  # name
    'ord': False,  # is ordinal?
    # Mapping between value and label
    'vals': ['0.0', '1.0', '2.0', '3.0', '4.0']  
}

因此,如果您想从头开始构建它,您所要做的就是正确的架构:

So if you want to build this from scratch all you have to do is correct schema:

from pyspark.sql.types import *
from pyspark.mllib.linalg import VectorUDT

# Lets assume we have only a vector
raw = transformed.select("features_raw")

# Dictionary equivalent to transformed.schema.fields[-1].metadata shown abov
meta = ... 
schema = StructType([StructField("features", VectorUDT(), metadata=meta)])

sqlContext.createDataFrame(raw.rdd, schema)

但是由于需要序列化、反序列化,所以效率很低.

But it is quite inefficient due to required serialization, deserialization.

Spark 2.2 起,您还可以使用元数据参数:

Since Spark 2.2 you can also use metadata argument:

df.withColumn("features", col("features").alias("features", metadata=meta))

另见将元数据附加到 Spark 中的向量列

这篇关于如何将列声明为 DataFrame 中的分类特征以用于 ml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆