如何用新列覆盖 Spark 数据框中的整个现有列? [英] How to overwrite entire existing column in Spark dataframe with new column?

查看:32
本文介绍了如何用新列覆盖 Spark 数据框中的整个现有列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用一个新的二进制标志列覆盖一个火花列.

I want to overwrite a spark column with a new column which is a binary flag.

我尝试直接覆盖 id2 列,但为什么它不像 Pandas 中的就地操作那样工作?

I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas?

如何做到不使用 withcolumn() 创建新列和 drop() 删除旧列?

How to do it without using withcolumn() to create new column and drop() to drop the old column?

我知道 spark 数据框是不可变的,是因为没有使用 withcolumn() & 的原因或有不同的覆盖方式.drop()?

I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()?

    df2 = spark.createDataFrame(
        [(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
        ('session', "timestamp1", "id2"))

    df2.select(df2.id2 > 0).show()

+---------+
|(id2 > 0)|
+---------+
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
+---------+
 # Attempting to overwriting df2.id2
    df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2')
    df2.show()
#Overwriting unsucessful
+-------+----------+----+
|session|timestamp1| id2|
+-------+----------+----+
|      1|         1| NaN|
|      1|         2| 5.0|
|      1|         3| NaN|
|      1|         4| NaN|
|      1|         5|10.0|
|      1|         6| NaN|
|      1|         6| NaN|
+-------+----------+----+

推荐答案

可以使用

d1.withColumnRenamed("colName", "newColName")
d1.withColumn("newColName", $"colName")

withColumnRenamed 将现有列重命名为新名称.

The withColumnRenamed renames the existing column to new name.

withColumn 创建一个具有给定名称的新列.如果已经存在,它会创建一个同名的新列并删除旧的.

The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.

在您的情况下,更改未应用于原始数据框 df2,它会更改列名称并作为新数据框返回,该数据框应分配给新变量以供进一步使用.

In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use.

d3 = df2.select((df2.id2 > 0).alias("id2"))

以上在您的情况下应该可以正常工作.

Above should work fine in your case.

希望这会有所帮助!

这篇关于如何用新列覆盖 Spark 数据框中的整个现有列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆