使用定义的 StructType 转换 Spark 数据帧的值 [英] Cast values of a Spark dataframe using a defined StructType

查看:26
本文介绍了使用定义的 StructType 转换 Spark 数据帧的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法使用 StructType 转换数据帧的所有值?

Is there a way to cast all the values of a dataframe using a StructType ?

让我用一个例子来解释我的问题:

Let me explain my question using an example :

假设我们从文件读取后获得了一个数据帧(我提供了生成此数据帧的代码,但在我的实际项目中,我是在从文件读取后获得此数据帧):

Let's say that we obtained a dataframe after reading from a file(I am providing a code which generates this dataframe, but in my real world project, I am obtaining this dataframe after reading from a file):

    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions._
    import spark.implicits._
    val rows1 = Seq(
      Row("1", Row("a", "b"), "8.00", Row("1","2")),
      Row("2", Row("c", "d"), "9.00", Row("3","4"))
    )

    val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

    val schema1 = StructType(
      Seq(
        StructField("id", StringType, true),
        StructField("s1", StructType(
          Seq(
            StructField("x", StringType, true),
            StructField("y", StringType, true)
          )
        ), true),
        StructField("d", StringType, true),
        StructField("s2", StructType(
          Seq(
            StructField("u", StringType, true),
            StructField("v", StringType, true)
          )
        ), true)
      )
    )

    val df1 = spark.createDataFrame(rows1Rdd, schema1)

    println("Schema with nested struct")
    df1.printSchema()

    root
    |-- id: string (nullable = true)
    |-- s1: struct (nullable = true)
    |    |-- x: string (nullable = true)
    |    |-- y: string (nullable = true)
    |-- d: string (nullable = true)
    |-- s2: struct (nullable = true)
    |    |-- u: string (nullable = true)
    |    |-- v: string (nullable = true)

现在假设我的客户向我提供了他想要的数据架构(相当于读取数据帧的架构,但具有不同的数据类型(包含 StringTypes、IntegerTypes ...)):

Now let's say that my client provided me the schema of the data he wants (which is equivalent to the schema of the read dataframe, but with different Datatypes (contains StringTypes, IntegerTypes ...)):

    val wantedSchema = StructType(
      Seq(
        StructField("id", IntegerType, true),
        StructField("s1", StructType(
          Seq(
            StructField("x", StringType, true),
            StructField("y", StringType, true)
          )
        ), true),
        StructField("d", DoubleType, true),
        StructField("s2", StructType(
          Seq(
            StructField("u", IntegerType, true),
            StructField("v", IntegerType, true)
          )
        ), true)
      )
    )

使用提供的 StructType 转换数据框值的最佳方法是什么?

What's the best way to cast the dataframe's values using the provided StructType ?

如果有一种方法可以应用于数据帧,并且它通过自行转换所有值来应用新的 StructTypes,那就太好了.

It would be great if there's a method that we can apply on a dataframe, and it applies the new StructTypes by casting all the values by itself.

PS:这是一个用作示例的小数据框,在我的项目中,数据框包含更多行.如果它是一个只有几列的小数据框,我可以轻松完成转换,但就我而言,我正在寻找一种智能解决方案,通过应用 StructType 来转换所有值,而无需手动转换每个列/值代码.

PS: This is a small Dataframe which is used as an example, in my project the dataframe contains much more rows. If It was a small Dataframe with few columns, I could have done the cast easily, but in my case, I am looking for a smart solution to cast all the values by applying a StructType and without having to cast each column/value manually in the code.

如果您能提供任何帮助,我将不胜感激,非常感谢!

i will be grateful for any help you can provide, Thanks a lot !

推荐答案

经过大量研究,这里有一个通用的解决方案来按照模式转换数据框:

After a lot of researches, here's a generic solution to cast a dataframe following a schema :

val castedDf = df1.selectExpr(wantedSchema.map(
  field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
): _*)

这是铸造数据框的架构:

Here's the schema of the casted dataframe :

castedDf.printSchema
root
|-- id: integer (nullable = true)
|-- s1: struct (nullable = true)
|    |-- x: string (nullable = true)
|    |-- y: string (nullable = true)
|-- d: double (nullable = true)
|-- s2: struct (nullable = true)
|    |-- u: integer (nullable = true)
|    |-- v: integer (nullable = true)

我希望它能帮助某人,我花了 5 天时间寻找这个简单/通用的解决方案.

I hope it's going to help someone, I spent 5 days looking for this simple/generic solution.

这篇关于使用定义的 StructType 转换 Spark 数据帧的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆