添加一个空列火花数据框 [英] Add an empty column to spark DataFrame

查看:179
本文介绍了添加一个空列火花数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为<一提到href=\"http://stackoverflow.com/questions/29483498/append-a-column-to-data-frame-in-apache-spark-1-3\">many <一href=\"http://apache-spark-user-list.1001560.n3.nabble.com/Append-column-to-Data-Frame-or-RDD-td22385.html\"相对=nofollow>其他地方的网络上,添加新的列到现有的数据帧并不简单。不幸的是它有这个功能(即使它是低效在分布式环境中)试图s,使用 unionAll 数据帧尤其是非常重要的code>。

什么是添加列于数据帧来方便<$ C $最优雅的解决方法C> unionAll ?

我的版本是这样的:

 从pyspark.sql.types导入StringType
从pyspark.sql.functions进口UserDefinedFunction
to_none = UserDefinedFunction(波长X:无,StringType())
new_df = old_df.withColumn('NEW_COLUMN',to_none(df_old ['any_col_from_old']))


解决方案

所有你需要的这里是文字和铸造:

 从pyspark.sql.functions导入已点燃new_df = old_df.withColumn('NEW_COLUMN',点燃(无).cast(StringType()))

一个完整的例子:

  DF = sc.parallelize([行(1,2),行(2,3)])toDF。()
df.printSchema()## 根
## | - 富:长(可为空=真)
## | - 条:字符串(可为空=真)new_df = df.withColumn('NEW_COLUMN',点燃(无).cast(StringType()))
new_df.printSchema()## 根
## | - 富:长(可为空=真)
## | - 条:字符串(可为空=真)
## | - NEW_COLUMN:字符串(可为空=真)new_df.show()## + --- + --- + ---------- +
## |美孚|酒吧| NEW_COLUMN |
## + --- + --- + ---------- +
## | 1 | 2 |空|
## | 2 | 3 |空|
## + --- + --- + ---------- +

一个斯卡拉相当于可以在这里找到:

创建空/空字段值新的数据框

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrames using unionAll.

What is the most elegant workaround for adding a null column to a DataFrame to facilitate a unionAll?

My version goes like this:

from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))

解决方案

All you need here is a literal and cast:

from pyspark.sql.functions import lit

new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

A full example:

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)

new_df = df.withColumn('new_column', lit(None).cast(StringType()))
new_df.printSchema()

## root
##  |-- foo: long (nullable = true)
##  |-- bar: string (nullable = true)
##  |-- new_column: string (nullable = true)

new_df.show()

## +---+---+----------+
## |foo|bar|new_column|
## +---+---+----------+
## |  1|  2|      null|
## |  2|  3|      null|
## +---+---+----------+

A Scala equivalent can be found here: Create new Dataframe with empty/null field values

这篇关于添加一个空列火花数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆