如何连接到 pyspark 数据框中的空列 [英] How to concatenate to a null column in pyspark dataframe
问题描述
我有一个下面的数据框,我想用一些值动态更新行
I have a below dataframe and I wanted to update the rows dynamically with some values
input_frame.show()
+----------+----------+---------+
|student_id|name |timestamp|
+----------+----------+---------+
| s1|testuser | t1|
| s1|sampleuser| t2|
| s2|test123 | t1|
| s2|sample123 | t2|
+----------+----------+---------+
input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id| name|timestamp|test|
+----------+----------+---------+----+
| s1| testuser| t1|null|
| s1|sampleuser| t2|null|
| s2| test123| t1|null|
| s2| sample123| t2|null|
+----------+----------+---------+----+
我想用一些值更新测试"列,并在该列上应用部分匹配的过滤器.但是连接到空列会再次导致空列.我们该怎么做?
I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?
推荐答案
使用concat_ws
,像这样:
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")
# This won't work
df = df.withColumn("concat", concat(df.a, df.b))
# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))
# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()
给出:
+----+----+------+-------------+---------+
| a| b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
| 1| 2| 12| 12| 12|
| 2|null| null| null| 2|
| 3| 4| 34| 34| 34|
| 4| 5| 45| 45| 45|
|null| 6| null| null| 6|
+----+----+------+-------------+---------+
请特别注意,将 NULL 列转换为字符串 不会 如您所愿,如果任何列为 null,将导致整行为 NULL.
Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.
没有处理更复杂场景的好方法,但请注意,如果您愿意,可以在 concat 旁边使用 when
语句忍受它的冗长,像这样:
There's no nice way of dealing with more complicated scenarios, but note that you can use a when
statement in side a concat if you're willing to
suffer the verboseness of it, like this:
df.withColumn("concat_custom", concat(
when(df.a.isNull(), lit('_')).otherwise(df.a),
when(df.b.isNull(), lit('_')).otherwise(df.b))
)
获取,例如:
+----+----+-------------+
| a| b|concat_custom|
+----+----+-------------+
| 1| 2| 12|
| 2|null| 2_|
| 3| 4| 34|
| 4| 5| 45|
|null| 6| _6|
+----+----+-------------+
这篇关于如何连接到 pyspark 数据框中的空列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!