如何在Spark 2.1上更新pyspark数据帧元数据? [英] How to update pyspark dataframe metadata on Spark 2.1?

查看:143
本文介绍了如何在Spark 2.1上更新pyspark数据帧元数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SparkML的OneHotEncoder面临一个问题,因为它读取数据帧元数据以确定它应该为其创建的稀疏矢量对象分配的值范围.

I'm facing an issue with the OneHotEncoder of SparkML since it reads dataframe metadata in order to determine the value range it should assign for the sparse vector object its creating.

更具体地说,我正在使用包含0到23之间的所有单个值的训练集对小时"字段进行编码.

More specifically, I'm encoding a "hour" field using a training set containing all individual values between 0 and 23.

现在,我使用管道的转换"方法对单行数据帧进行评分.

Now I'm scoring a single row data frame using the "transform" method od the Pipeline.

不幸的是,这导致OneHotEncoder的编码不同的稀疏矢量对象

Unfortunately, this leads to a differently encoded sparse vector object for the OneHotEncoder

(24,[5],[1.0])vs.(11,[10],[1.0])

(24,[5],[1.0]) vs. (11,[10],[1.0])

我已经在此处对此进行了记录,被确定为重复项.因此,在此线程中,有一个解决方案发布到更新数据框的元数据以反映小时"字段的实际范围:

I've documented this here, but this was identified as duplicate. So in this thread there is a solution posted to update the dataframes's metadata to reflect the real range of the "hour" field:

from pyspark.sql.functions import col

meta = {"ml_attr": {
    "vals": [str(x) for x in range(6)],   # Provide a set of levels
    "type": "nominal", 
    "name": "class"}}

loaded.transform(
    df.withColumn("class", col("class").alias("class", metadata=meta)) )

不幸的是,我收到此错误:

Unfortunalely I get this error:

TypeError:alias()收到了意外的关键字参数元数据"

TypeError: alias() got an unexpected keyword argument 'metadata'

推荐答案

在PySpark 2.1中,alias方法没有参数metadata( Spark陷阱 ,由 @eliasah @ zero323 :

In PySpark 2.1, the alias method has no argument metadata (docs) - this became available in Spark 2.2; nevertheless, it is still possible to modify column metadata in PySpark < 2.2, thanks to the incredible Spark Gotchas, maintained by @eliasah and @zero323:

import json

from pyspark import SparkContext
from pyspark.sql import Column
from pyspark.sql.functions import col

spark.version
# u'2.1.1'

df = sc.parallelize((
        (0, "x", 2.0),
        (1, "y", 3.0),
        (2, "x", -1.0)
        )).toDF(["label", "x1", "x2"])

df.show()
# +-----+---+----+ 
# |label| x1|  x2|
# +-----+---+----+
# |    0|  x| 2.0|
# |    1|  y| 3.0|
# |    2|  x|-1.0|
# +-----+---+----+

假设我们想强制label数据在0到5之间的可能性,尽管在我们的数据帧中在0到2之间,这是我们应该如何修改列元数据的方法:

Supposing that we want to enforce the possibility of our label data to be between 0 and 5, despite that in our dataframe are between 0 and 2, here is how we should modify the column metadata:

def withMeta(self, alias, meta):
    sc = SparkContext._active_spark_context
    jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
    return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))

Column.withMeta = withMeta

# new metadata:
meta = {"ml_attr": {"name": "label_with_meta",
                    "type": "nominal",
                    "vals": [str(x) for x in range(6)]}}

df_with_meta = df.withColumn("label_with_meta", col("label").withMeta("", meta))

也应通过零323来此答案

Kudos also to this answer by zero323!

这篇关于如何在Spark 2.1上更新pyspark数据帧元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆