PySpark:如何更新嵌套列? [英] PySpark: How to Update Nested Columns?

查看：21 发布时间：2021/11/14 22:54:08 scala apache-spark pyspark apache-spark-sql

本文介绍了PySpark:如何更新嵌套列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

StackOverflow 有一些关于如何更新数据框中嵌套列的答案.但是，其中一些看起来有点复杂.

StackOverflow has a few answers on how to update nested columns in a dataframe. However, some of them looks a bit complex.

在搜索时，我从 DataBricks 中找到了处理相同场景的文档:https://docs.databricks.com/user-guide/faq/update-nested-column.html

While searching, I found this documentation from DataBricks that handles the same scenarios: https://docs.databricks.com/user-guide/faq/update-nested-column.html

val updated = df.selectExpr("""
    named_struct(
        'metadata', metadata,
        'items', named_struct(
          'books', named_struct('fees', items.books.fees * 1.01),
          'paper', items.paper
        )
    ) as named_struct
""").select($"named_struct.metadata", $"named_struct.items")

那看起来也很干净.不幸的是，我不知道 Scala.我如何将其翻译成 Python?

That looks pretty clean as well. Unfortunately though, I don't know Scala. How would I translate this to Python?

推荐答案

这可能会帮助您入门；使用 1 行 ex 将您的 Databricks 链接转换为 python.供你探索

This might help you get started; converted your Databricks link to python using a 1 row ex. for you to explore

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema = StructType()\
.add("metadata", StructType()\
     .add("eventid", IntegerType(), True)\
     .add("hostname", StringType(), True)\
     .add("timestamp", StringType(), True))\
.add("items", StructType()\
     .add("books", StructType()\
         .add("fees", DoubleType(), True))\
     .add("paper", StructType()\
         .add("pages", IntegerType(), True)))

nested_row = [

    (
        {
            "metadata": {
                "eventid": 9,
                "hostname": "999.999.999",
                "timestamp": "9999-99-99 99:99:99"
            },
            "items": {
                "books": {
                    "fees": 99.99
                },

                "paper": {
                    "pages": 9999
                }
            }
        }
    )
]

df = spark.createDataFrame(nested_row, schema)

df.printSchema()

df.selectExpr("""
    named_struct(
        'metadata', metadata,
        'items', named_struct(
          'books', named_struct('fees', items.books.fees * 1.01),
          'paper', items.paper
        )
    ) as named_struct
""").select(col("named_struct.metadata"), col("named_struct.items"))\
.show(truncate=False)

root
 |-- metadata: struct (nullable = true)
 |    |-- eventid: integer (nullable = true)
 |    |-- hostname: string (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |-- items: struct (nullable = true)
 |    |-- books: struct (nullable = true)
 |    |    |-- fees: double (nullable = true)
 |    |-- paper: struct (nullable = true)
 |    |    |-- pages: integer (nullable = true)

+-------------------------------------+-----------------+
|metadata                             |items            |
+-------------------------------------+-----------------+
|[9, 999.999.999, 9999-99-99 99:99:99]|[[99.99], [9999]]|
+-------------------------------------+-----------------+

+-------------------------------------+------------------------------+
|metadata                             |items                         |
+-------------------------------------+------------------------------+
|[9, 999.999.999, 9999-99-99 99:99:99]|[[100.98989999999999], [9999]]|
+-------------------------------------+------------------------------+

这篇关于PySpark:如何更新嵌套列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark:如何更新嵌套列? [英] PySpark: How to Update Nested Columns?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark:如何更新嵌套列? [英] PySpark: How to Update Nested Columns?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭