有没有一种方法可以为Spark数据帧添加额外的元数据? [英] Is there a way to add extra metadata for Spark dataframes?
问题描述
是否可以向DataFrame
s添加额外的元数据?
Is it possible to add extra meta data to DataFrame
s?
我有Spark DataFrame
,需要保留更多信息.例如:一个DataFrame
,我想为此记住一个Integer id列中使用率最高的索引.
I have Spark DataFrame
s for which I need to keep extra information. Example: A DataFrame
, for which I want to "remember" the highest used index in an Integer id column.
我使用单独的DataFrame
来存储此信息.当然,单独保存这些信息很繁琐且容易出错.
I use a separate DataFrame
to store this information. Of course, keeping this information separately is tedious and error-prone.
是否有更好的解决方案将此类额外信息存储在DataFrame
上?
Is there a better solution to store such extra information on DataFrame
s?
推荐答案
扩大和Scala-fy nealmcb的答案(该问题被标记为scala,而不是python,所以我认为这个答案不会是题外之意或多余的),假设您有一个DataFrame:
To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:
import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")
以某种方式获得最大值或要在DataFrame上记住的内容:
And some way to get the max or whatever you want to memoize on the DataFrame:
val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)
sql.types.Metadata
只能容纳字符串,布尔值,某些类型的数字和其他元数据结构.所以我们必须使用Long:
sql.types.Metadata
can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:
val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()
DataFrame.withColumn()实际上有一个重载,允许在最后提供一个元数据参数,但是它被莫名其妙地标记为[private],所以我们就可以做它—使用Column.as(alias, metadata)
:
DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata)
:
val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)
dfWithMax
现在具有所需的元数据(带有其列)!
dfWithMax
now has (a column with) the metadata you want!
dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}
或者以编程方式和类型安全的方式(Metadata.getLong()等不返回Option并可能引发找不到密钥"异常):
Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):
dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992
在您的情况下,将max分配给列是有意义的,但是在将元数据附加到DataFrame而不是特定列的一般情况下,看来您必须采用其他答案所描述的包装路线.
Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.
这篇关于有没有一种方法可以为Spark数据帧添加额外的元数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!