有没有一种方法可以为Spark数据帧添加额外的元数据? [英] Is there a way to add extra metadata for Spark dataframes?

查看:89
本文介绍了有没有一种方法可以为Spark数据帧添加额外的元数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以向DataFrame s添加额外的元数据?

Is it possible to add extra meta data to DataFrames?

我有Spark DataFrame,需要保留更多信息.例如:一个DataFrame,我想为此记住一个Integer id列中使用率最高的索引.

I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.

我使用单独的DataFrame来存储此信息.当然,单独保存这些信息很繁琐且容易出错.

I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.

是否有更好的解决方案将此类额外信息存储在DataFrame上?

Is there a better solution to store such extra information on DataFrames?

推荐答案

扩大和Scala-fy nealmcb的答案(该问题被标记为scala,而不是python,所以我认为这个答案不会是题外之意或多余的),假设您有一个DataFrame:

To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:

import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")

以某种方式获得最大值或要在DataFrame上记住的内容:

And some way to get the max or whatever you want to memoize on the DataFrame:

val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)

sql.types.Metadata只能容纳字符串,布尔值,某些类型的数字和其他元数据结构.所以我们必须使用Long:

sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:

val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()

DataFrame.withColumn()实际上有一个重载,允许在最后提供一个元数据参数,但是它被莫名其妙地标记为[private],所以我们就可以做它—使用Column.as(alias, metadata):

DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):

val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)

dfWithMax现在具有所需的元数据(带有其列)!

dfWithMax now has (a column with) the metadata you want!

dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}

或者以编程方式和类型安全的方式(Metadata.getLong()等不返回Option并可能引发找不到密钥"异常):

Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):

dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992

在您的情况下,将max分配给列是有意义的,但是在将元数据附加到DataFrame而不是特定列的一般情况下,看来您必须采用其他答案所描述的包装路线.

Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.

这篇关于有没有一种方法可以为Spark数据帧添加额外的元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆