更新的火花数据框列 [英] Updating a dataframe column in spark
问题描述
寻找在新的火花数据帧的API,也不清楚是否有可能要修改数据帧列。
Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns.
我怎么会去行 X
列是
A数据框的改变价值?
How would I go about changing a value in row x
column y
of a dataframe?
在熊猫
这将是 df.ix [X,Y] = NEW_VALUE
推荐答案
虽然你不能修改列正因为如此,你可以在一列操作,并返回一个新的数据框反映这种变化。对于你首先创建一个 UserDefinedFunction
实施操作申请,然后有选择地应用该功能仅目标列。在Python:
While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction
implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
name = 'target_column'
udf = UserDefinedFunction(lambda x: 'new_value', Stringtype())
new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])
new_df
现在拥有相同的架构为 old_df
(假设 old_df.target_column
是类型 StringType
为好),但在列的所有值 target_column
将 NEW_VALUE
。
new_df
now has the same schema as old_df
(assuming that old_df.target_column
was of type StringType
as well) but all values in column target_column
will be new_value
.
这篇关于更新的火花数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!