使用定义的StructType转换Spark数据帧的值 [英] Cast values of a Spark dataframe using a defined StructType

查看:83
本文介绍了使用定义的StructType转换Spark数据帧的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法可以使用StructType转换数据帧的所有值?

Is there a way to cast all the values of a dataframe using a StructType ?

让我用一个例子来说明我的问题:

Let me explain my question using an example :

假设我们从文件中读取后获得了一个数据框(我提供了生成该数据框的代码,但是在我的真实世界项目中,我是从文件中读取后获得了该数据框):

Let's say that we obtained a dataframe after reading from a file(I am providing a code which generates this dataframe, but in my real world project, I am obtaining this dataframe after reading from a file):

    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions._
    import spark.implicits._
    val rows1 = Seq(
      Row("1", Row("a", "b"), "8.00", Row("1","2")),
      Row("2", Row("c", "d"), "9.00", Row("3","4"))
    )

    val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

    val schema1 = StructType(
      Seq(
        StructField("id", StringType, true),
        StructField("s1", StructType(
          Seq(
            StructField("x", StringType, true),
            StructField("y", StringType, true)
          )
        ), true),
        StructField("d", StringType, true),
        StructField("s2", StructType(
          Seq(
            StructField("u", StringType, true),
            StructField("v", StringType, true)
          )
        ), true)
      )
    )

    val df1 = spark.createDataFrame(rows1Rdd, schema1)

    println("Schema with nested struct")
    df1.printSchema()

    root
    |-- id: string (nullable = true)
    |-- s1: struct (nullable = true)
    |    |-- x: string (nullable = true)
    |    |-- y: string (nullable = true)
    |-- d: string (nullable = true)
    |-- s2: struct (nullable = true)
    |    |-- u: string (nullable = true)
    |    |-- v: string (nullable = true)

现在让我们说我的客户端为我提供了他想要的数据的架构(与读取的数据框的架构等效,但是具有不同的数据类型(包含StringTypes,IntegerTypes ...)):

Now let's say that my client provided me the schema of the data he wants (which is equivalent to the schema of the read dataframe, but with different Datatypes (contains StringTypes, IntegerTypes ...)):

    val wantedSchema = StructType(
      Seq(
        StructField("id", IntegerType, true),
        StructField("s1", StructType(
          Seq(
            StructField("x", StringType, true),
            StructField("y", StringType, true)
          )
        ), true),
        StructField("d", DoubleType, true),
        StructField("s2", StructType(
          Seq(
            StructField("u", IntegerType, true),
            StructField("v", IntegerType, true)
          )
        ), true)
      )
    )

使用提供的StructType转换数据框的值的最佳方法是什么?

What's the best way to cast the dataframe's values using the provided StructType ?

如果有一种方法可以应用到数据框上,并且通过自身强制转换所有值来应用新的StructType,那将是很好的选择.

It would be great if there's a method that we can apply on a dataframe, and it applies the new StructTypes by casting all the values by itself.

PS:这是一个小的数据框,仅作为示例,在我的项目中,该数据框包含更多的行. 如果这是一个只有几列的小型Dataframe,我可以很容易地进行转换,但就我而言,我正在寻找一种智能解决方案,可以通过应用StructType来转换所有值,而不必手动转换每个列/值.代码.

PS: This is a small Dataframe which is used as an example, in my project the dataframe contains much more rows. If It was a small Dataframe with few columns, I could have done the cast easily, but in my case, I am looking for a smart solution to cast all the values by applying a StructType and without having to cast each column/value manually in the code.

我将非常感谢您能提供的任何帮助,非常感谢!

i will be grateful for any help you can provide, Thanks a lot !

推荐答案

经过大量研究,这是一种通用的解决方案,用于根据模式强制转换数据框:

After a lot of researches, here's a generic solution to cast a dataframe following a schema :

val castedDf = df1.selectExpr(wantedSchema.map(
  field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
): _*)

这是强制转换的数据框的架构:

Here's the schema of the casted dataframe :

castedDf.printSchema
root
|-- id: integer (nullable = true)
|-- s1: struct (nullable = true)
|    |-- x: string (nullable = true)
|    |-- y: string (nullable = true)
|-- d: double (nullable = true)
|-- s2: struct (nullable = true)
|    |-- u: integer (nullable = true)
|    |-- v: integer (nullable = true)

我希望它能对某人有所帮助,我花了5天的时间寻找这种简单/通用的解决方案.

I hope it's going to help someone, I spent 5 days looking for this simple/generic solution.

这篇关于使用定义的StructType转换Spark数据帧的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆