如何添加一个新列火花数据帧（Pyspark）？ [英] How do I add a new column to spark data frame (Pyspark)?

查看：788 发布时间：2016/5/22 15:59:08 python apache-spark apache-spark-sql pyspark

本文介绍了如何添加一个新列火花数据帧（Pyspark）？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有（使用Pyspark 1.5.1），并想添加一个新列。火花数据帧

I have a Spark data frame (using Pyspark 1.5.1) and would like to add a new column.

尝试没有成功如下：

type(randomed_hours) # => list

#Create in Python and transform to RDD

new_col = pd.DataFrame(randomed_hours,columns=['new_col'])

spark_new_col = sqlContext.createDataFrame(new_col)

my_df_spark.withColumn("hours",  spark_new_col["new_col"])

也得到了使用这样的错误：

Also got an error using this:

my_df_spark.withColumn("hours",  sc.parallelize(randomed_hours))

那么，如何添加新列（基于Python向量）现有的数据帧PySpark？

So how do I add a new column (based on Python vector) to existing Data frame with PySpark ?

谢谢！
鲍里斯

Thanks ! Boris

推荐答案

您不能任意列在星火添加到数据帧。新列只能通过文字来创建：

You cannot add an arbitrary column to a DataFrame in Spark. New columns can be created only by using literals:

from pyspark.sql.functions import lit

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

df_with_x4 = df.withColumn("x4", lit(0))
df_with_x4.show()

## +---+---+-----+---+
## | x1| x2|   x3| x4|
## +---+---+-----+---+
## |  1|  a| 23.0|  0|
## |  3|  B|-23.0|  0|
## +---+---+-----+---+

改变现有列：

from pyspark.sql.functions import exp

df_with_x5 = df_with_x4.withColumn("x5", exp("x3"))
df_with_x5.show()

## +---+---+-----+---+--------------------+
## | x1| x2|   x3| x4|                  x5|
## +---+---+-----+---+--------------------+
## |  1|  a| 23.0|  0| 9.744803446248903E9|
## |  3|  B|-23.0|  0|1.026187963170189...|
## +---+---+-----+---+--------------------+

收录使用加入：

from pyspark.sql.functions import exp

lookup = sqlContext.createDataFrame([(1, "foo"), (2, "bar")], ("k", "v"))
df_with_x6 = (df_with_x5
    .join(lookup, col("x1") == col("k"), "leftouter")
    .drop("k")
    .withColumnRenamed("v", "x6"))

## +---+---+-----+---+--------------------+----+
## | x1| x2|   x3| x4|                  x5|  x6|
## +---+---+-----+---+--------------------+----+
## |  1|  a| 23.0|  0| 9.744803446248903E9| foo|
## |  3|  B|-23.0|  0|1.026187963170189...|null|
## +---+---+-----+---+--------------------+----+

或功能/ UDF产生的：

or generated with function / udf:

from pyspark.sql.functions import rand

df_with_x7 = df_with_x6.withColumn("x7", rand())
df_with_x7.show()

## +---+---+-----+---+--------------------+----+-------------------+
## | x1| x2|   x3| x4|                  x5|  x6|                 x7|
## +---+---+-----+---+--------------------+----+-------------------+
## |  1|  a| 23.0|  0| 9.744803446248903E9| foo|0.41930610446846617|
## |  3|  B|-23.0|  0|1.026187963170189...|null|0.37801881545497873|
## +---+---+-----+---+--------------------+----+-------------------+

如果你想添加任意RDD内容为一列，你可以

If you want to add content of an arbitrary RDD as a column you can

将行号到现有的数据帧

通话 zipWithIndex 上RDD并将其转换成数据帧

加入使用索引既作为连接键

add row numbers to existing data frame
call zipWithIndex on RDD and convert it to data frame
join both using index as a join key

这篇关于如何添加一个新列火花数据帧（Pyspark）？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何添加一个新列火花数据帧（Pyspark）？ [英] How do I add a new column to spark data frame (Pyspark)?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何添加一个新列火花数据帧（Pyspark）？ [英] How do I add a new column to spark data frame (Pyspark)?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭