如何添加一个新列火花数据帧(Pyspark)? [英] How do I add a new column to spark data frame (Pyspark)?
本文介绍了如何添加一个新列火花数据帧(Pyspark)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有(使用Pyspark 1.5.1),并想添加一个新列。火花数据帧
I have a Spark data frame (using Pyspark 1.5.1) and would like to add a new column.
尝试没有成功如下:
type(randomed_hours) # => list
#Create in Python and transform to RDD
new_col = pd.DataFrame(randomed_hours,columns=['new_col'])
spark_new_col = sqlContext.createDataFrame(new_col)
my_df_spark.withColumn("hours", spark_new_col["new_col"])
也得到了使用这样的错误:
Also got an error using this:
my_df_spark.withColumn("hours", sc.parallelize(randomed_hours))
那么,如何添加新列(基于Python向量)现有的数据帧PySpark?
So how do I add a new column (based on Python vector) to existing Data frame with PySpark ?
谢谢!
鲍里斯
Thanks ! Boris
推荐答案
您不能任意列在星火添加到数据帧
。新列只能通过文字来创建:
You cannot add an arbitrary column to a DataFrame
in Spark. New columns can be created only by using literals:
from pyspark.sql.functions import lit
df = sqlContext.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
df_with_x4 = df.withColumn("x4", lit(0))
df_with_x4.show()
## +---+---+-----+---+
## | x1| x2| x3| x4|
## +---+---+-----+---+
## | 1| a| 23.0| 0|
## | 3| B|-23.0| 0|
## +---+---+-----+---+
改变现有列:
from pyspark.sql.functions import exp
df_with_x5 = df_with_x4.withColumn("x5", exp("x3"))
df_with_x5.show()
## +---+---+-----+---+--------------------+
## | x1| x2| x3| x4| x5|
## +---+---+-----+---+--------------------+
## | 1| a| 23.0| 0| 9.744803446248903E9|
## | 3| B|-23.0| 0|1.026187963170189...|
## +---+---+-----+---+--------------------+
收录使用加入
:
from pyspark.sql.functions import exp
lookup = sqlContext.createDataFrame([(1, "foo"), (2, "bar")], ("k", "v"))
df_with_x6 = (df_with_x5
.join(lookup, col("x1") == col("k"), "leftouter")
.drop("k")
.withColumnRenamed("v", "x6"))
## +---+---+-----+---+--------------------+----+
## | x1| x2| x3| x4| x5| x6|
## +---+---+-----+---+--------------------+----+
## | 1| a| 23.0| 0| 9.744803446248903E9| foo|
## | 3| B|-23.0| 0|1.026187963170189...|null|
## +---+---+-----+---+--------------------+----+
或功能/ UDF产生的:
or generated with function / udf:
from pyspark.sql.functions import rand
df_with_x7 = df_with_x6.withColumn("x7", rand())
df_with_x7.show()
## +---+---+-----+---+--------------------+----+-------------------+
## | x1| x2| x3| x4| x5| x6| x7|
## +---+---+-----+---+--------------------+----+-------------------+
## | 1| a| 23.0| 0| 9.744803446248903E9| foo|0.41930610446846617|
## | 3| B|-23.0| 0|1.026187963170189...|null|0.37801881545497873|
## +---+---+-----+---+--------------------+----+-------------------+
如果你想添加任意RDD内容为一列,你可以
If you want to add content of an arbitrary RDD as a column you can
- 将行号到现有的数据帧
- 通话
zipWithIndex
上RDD并将其转换成数据帧 - 加入使用索引既作为连接键
- add row numbers to existing data frame
- call
zipWithIndex
on RDD and convert it to data frame - join both using index as a join key
这篇关于如何添加一个新列火花数据帧(Pyspark)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文