如何更改pyspark中的列元数据? [英] How to change column metadata in pyspark?

查看:139
本文介绍了如何更改pyspark中的列元数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何更新PySpark中的列元数据? 我有对应于分类(字符串)特征的标称编码的元数据值,我想以自动方式将其解码.除非您重新创建架构,否则无法直接在pyspark API中编写元数据.提供完整的架构描述(如

How can I update column metadata in PySpark? I have metadata values corresponding to nominal encoding of categorical (string) features and I would like to decode them back in automated way. Writing the metadata in pyspark API is not directly available unless you recreate the schema. Is it possible to edit metadata in PySpark on the go without converting dataset to RDD and converting it back, provided complete schema description (as described here)?

示例列表:

# Create DF
df.show()

# +---+-------------+
# | id|     features|
# +---+-------------+
# |  0|[1.0,1.0,4.0]|
# |  1|[2.0,2.0,4.0]|
# +---+-------------+
# - That one has all the necessary metadata about what is encoded in feature column

# Slice one feature out
df = VectorSlicer(inputCol='features', outputCol='categoryIndex', indices=[1]).transform(df)
df = df.drop('features')
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|        [1.0]|
# |  1|        [2.0]|
# +---+-------------+
# categoryIndex now carries metadata about singular array with encoding

# Get rid of the singular array
udf = UserDefinedFunction(lambda x: float(x[0]), returnType=DoubleType())
df2 = df.select(*[udf(column).alias(column) if column == 'categoryIndex' else column for column in df.columns])
# +---+-------------+
# | id|categoryIndex|
# +---+-------------+
# |  0|          1.0|
# |  1|          2.0|
# +---+-------------+
# - Metadata is lost for that one


# Write metadata
extract = {...}
df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)
# metadata is readable from df2.schema.fields[1].metadata but is not affective. 
# Saving and restoring df from parque destroys the change
# Decode categorical
df = IndexToString(inputCol="categoryIndex", outputCol="category").transform(df)
# ERROR. Was supposed to decode the categorical values

Question provides an insight about how to work with VectorAssembler, VectorIndexer and how to add metadata by constructing a complete schema using StructType but yet does not answer my question.

推荐答案

在两种情况下,都将丢失元数据:

In both cases losing metadata is expected:

  • 调用Python udf时,输入Column及其元数据与输出Column之间没有关系. UserDefinedFunction(在Python和Scala中都是)是Spark引擎的黑匣子.
  • 直接将数据分配给Python模式对象:

  • When you call Python udf there is no relationship between input Column and its metadata, and output Column. UserDefinedFunction (both in Python and Scala) are black boxes for the Spark engine.
  • Assigning data directly to the Python schema object:

df2.schema.fields[1].metadata = extract(df.schema.fields[1].metadata)

根本不是有效的方法. Spark DataFrame是围绕JVM对象的事物包装器. Python包装器中的任何更改对于JVM后端都是完全不透明的,并且根本不会传播:

is not a valid approach at all. Spark DataFrame is a thing wrapper around JVM object. Any changes in the Python wrappers, are completely opaque for JVM backend, and won't be propagated at all:

import json 

df = spark.createDataFrame([(1, "foo")], ("k", "v"))
df.schema[-1].metadata = {"foo": "bar"}

json.loads(df._jdf.schema().json())

## {'fields': [{'metadata': {}, 'name': 'k', 'nullable': True, 'type': 'long'},
##   {'metadata': {}, 'name': 'v', 'nullable': True, 'type': 'string'}],
## 'type': 'struct'}

甚至保留在Python中:

or even preserved in Python:

df.select("*").schema[-1].metadata
## {}

使用 Spark< 2.2 ,您可以使用一个小包装(取自 Spark Gotchas ,由我和 @eliasah 维护):

With Spark < 2.2 you can use a small wrapper (taken from Spark Gotchas, maintained by me and @eliasah):

def withMeta(self, alias, meta):
    sc = SparkContext._active_spark_context
    jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
    return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))

df.withColumn("foo", withMeta(col("foo"), "", {...}))

使用 Spark> = 2.2 ,您可以使用Column.alias:

df.withColumn("foo", col("foo").alias("", metadata={...}))

这篇关于如何更改pyspark中的列元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆