添加一个空列火花数据框 [英] Add an empty column to spark DataFrame
问题描述
作为<一提到href=\"http://stackoverflow.com/questions/29483498/append-a-column-to-data-frame-in-apache-spark-1-3\">many <一href=\"http://apache-spark-user-list.1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD-td22385.html\"相对=nofollow>其他地方的网络上,添加新的列到现有的数据帧并不简单。不幸的是它有这个功能(即使它是低效在分布式环境中)试图s,使用 unionAll 连接两个
数据帧
尤其是非常重要的code>。
什么是添加空
列于数据帧
来方便<$ C $最优雅的解决方法C> unionAll ?
我的版本是这样的:
从pyspark.sql.types导入StringType
从pyspark.sql.functions进口UserDefinedFunction
to_none = UserDefinedFunction(波长X:无,StringType())
new_df = old_df.withColumn('NEW_COLUMN',to_none(df_old ['any_col_from_old']))
所有你需要的这里是文字和铸造:
从pyspark.sql.functions导入已点燃new_df = old_df.withColumn('NEW_COLUMN',点燃(无).cast(StringType()))
一个完整的例子:
DF = sc.parallelize([行(1,2),行(2,3)])toDF。()
df.printSchema()## 根
## | - 富:长(可为空=真)
## | - 条:字符串(可为空=真)new_df = df.withColumn('NEW_COLUMN',点燃(无).cast(StringType()))
new_df.printSchema()## 根
## | - 富:长(可为空=真)
## | - 条:字符串(可为空=真)
## | - NEW_COLUMN:字符串(可为空=真)new_df.show()## + --- + --- + ---------- +
## |美孚|酒吧| NEW_COLUMN |
## + --- + --- + ---------- +
## | 1 | 2 |空|
## | 2 | 3 |空|
## + --- + --- + ---------- +
创建空/空字段值新的数据框As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrame
s using unionAll
.
What is the most elegant workaround for adding a null
column to a DataFrame
to facilitate a unionAll
?
My version goes like this:
from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))
All you need here is a literal and cast:
from pyspark.sql.functions import lit
new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))
A full example:
df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()
## root
## |-- foo: long (nullable = true)
## |-- bar: string (nullable = true)
new_df = df.withColumn('new_column', lit(None).cast(StringType()))
new_df.printSchema()
## root
## |-- foo: long (nullable = true)
## |-- bar: string (nullable = true)
## |-- new_column: string (nullable = true)
new_df.show()
## +---+---+----------+
## |foo|bar|new_column|
## +---+---+----------+
## | 1| 2| null|
## | 2| 3| null|
## +---+---+----------+
A Scala equivalent can be found here: Create new Dataframe with empty/null field values
这篇关于添加一个空列火花数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!