在 spark 中更新数据框列 [英] Updating a dataframe column in spark

查看：31 发布时间：2021/11/14 21:19:18 python apache-spark pyspark apache-spark-sql spark-dataframe

本文介绍了在 spark 中更新数据框列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

看新的spark dataframe api，不清楚是否可以修改dataframe列.

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns.

我将如何更改数据帧的行 x 列 y 中的值?

How would I go about changing a value in row x column y of a dataframe?

在 pandas 中，这将是 df.ix[x,y] = new_value

合并下面所说的内容，您不能修改现有数据帧，因为它是不可变的，但您可以返回一个具有所需修改的新数据帧.

Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.

如果您只想根据条件替换列中的值，例如 np.where:

If you just want to replace a value in a column based on a condition, like np.where:

from pyspark.sql import functions as F

update_func = (F.when(F.col('update_col') == replace_val, new_value)
                .otherwise(F.col('update_col')))
df = df.withColumn('new_column_name', update_func)

如果要对列执行某些操作并创建添加到数据帧的新列:

If you want to perform some operation on a column and create a new column that is added to the dataframe:

import pyspark.sql.functions as F
import pyspark.sql.types as T

def my_func(col):
    do stuff to column here
    return transformed_value

# if we assume that my_func returns a string
my_udf = F.UserDefinedFunction(my_func, T.StringType())

df = df.withColumn('new_column_name', my_udf('update_col'))

如果您希望新列与旧列具有相同的名称，您可以添加额外的步骤:

If you want the new column to have the same name as the old column, you could add the additional step:

df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')

推荐答案

虽然您无法修改列，但您可以对列进行操作并返回反映该更改的新 DataFrame.为此，您首先要创建一个 UserDefinedFunction 来实现要应用的操作，然后有选择地将该函数仅应用于目标列.在 Python 中:

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', StringType())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df 现在具有与 old_df 相同的架构(假设 old_df.target_column 的类型为 StringType 作为好吧)但 target_column 列中的所有值都将是 new_value.

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.

这篇关于在 spark 中更新数据框列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 spark 中更新数据框列 [英] Updating a dataframe column in spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 spark 中更新数据框列 [英] Updating a dataframe column in spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭