使用列表在PySpark数据框中创建一列,该列表的索引位于数据框的一列中 [英] Create a column in a PySpark dataframe using a list whose indices are present in one column of the dataframe

查看:69
本文介绍了使用列表在PySpark数据框中创建一列,该列表的索引位于数据框的一列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python和PySpark的新手.我在PySpark中有一个数据框,如下所示:

I'm new to Python and PySpark. I have a dataframe in PySpark like the following:

## +---+---+------+
## | x1| x2|   x3 |
## +---+---+------+
## |  0| a |  13.0|
## |  2| B | -33.0|
## |  1| B | -63.0|
## +---+---+------+

我有一个数组: arr = [10,12,13]

I have an array: arr = [10, 12, 13]

我想在数据框中创建一列x4,以便它应该基于x1的值作为索引从列表中获得相应的值.最终数据集应如下所示:

I want to create a column x4 in the dataframe such that it should have the corresponding values from the list based on the values of x1 as indices. The final dataset should look like:

## +---+---+------+-----+
## | x1| x2|   x3 |  x4 |
## +---+---+------+-----+
## |  0| a |  13.0| 10  |
## |  2| B | -33.0| 13  |
## |  1| B | -63.0| 12  |
## +---+---+------+-----+

我尝试使用以下代码实现这一目标:

I have tried using the following code to achieve so:

df.withColumn("x4", lit(arr[col('x1')])).show()

但是,我遇到一个错误:

However, I am getting an error:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

有什么办法可以有效地实现这一目标?

Is there any way I can achieve this efficiently?

推荐答案

在数组索引和原始DataFrame索引之间进行联接时,一种方法是将数组转换为DataFrame,生成rownumber()-1(将成为您的索引),然后将两个DataFrame结合在一起.

As you're doing a join between the indices of your array and your original DataFrame, one approach would be to convert your array into a DataFrame, generate the rownumber()-1 (which becomes your indices) and then join the two DataFrames together.

from pyspark.sql import Row

# Create original DataFrame `df`
df = sqlContext.createDataFrame(
    [(0, "a", 13.0), (2, "B", -33.0), (1, "B", -63.0)], ("x1", "x2", "x3"))
df.createOrReplaceTempView("df")

# Create column "x4"
row = Row("x4")

# Take the array
arr = [10, 12, 13]

# Convert Array to RDD, and then create DataFrame
rdd = sc.parallelize(arr)
df2 = rdd.map(row).toDF()
df2.createOrReplaceTempView("df2")

# Create indices via row number
df3 = spark.sql("SELECT (row_number() OVER (ORDER by x4))-1 as indices, * FROM df2")
df3.createOrReplaceTempView("df3")

现在您拥有两个数据框:dfdf3,您可以运行下面的SQL查询以将两个数据框连接在一起.

Now that you have the two DataFrames: df and df3, you can run the SQL query below to join the two DataFrames together.

select a.x1, a.x2, a.x3, b.x4 from df a join df3 b on b.indices = a.x1

请注意,这也是将列添加到DataFrames 的很好的参考答案.

Note, here is also good reference answer to the adding columns to DataFrames.

这篇关于使用列表在PySpark数据框中创建一列,该列表的索引位于数据框的一列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆