将PySpark数据框过滤器的结果收集到一个变量中 [英] Collecting the result of PySpark Dataframe filter into a variable

查看:93
本文介绍了将PySpark数据框过滤器的结果收集到一个变量中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PySpark 数据框。我的数据集包含三个属性, id 名称地址 。我试图根据 name 值删除相应的行。我一直在尝试获取要删除的行的唯一 id

I am using the PySpark dataframe. My dataset contains three attributes, id, name and address. I am trying to delete the corresponding row based on the name value. What I've been trying is to get unique id of the row I want to delete

ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()

我得到的输出如下: [Row(id ='382')]

我想知道如何使用 id 删除一行。另外,如何将数据框中的某些值替换为另一个值?例如,将所有 values == Bruce 替换为 John

I am wondering how can I use id to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing all values == "Bruce" with "John"

推荐答案

来自 pyspark.sql.DataFrame.collect() ,函数:

From the docs for pyspark.sql.DataFrame.collect(), the function:


将所有记录作为行的列表返回。

Returns all the records as a list of Row.

pyspark.sql.Row 可以像字典值一样进行访问。

The fields in a pyspark.sql.Row can be accessed like dictionary values.

以您的示例为例:

ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
#[Row(id='382')]

您可以执行以下操作来访问 id 字段:

You can access the id field by doing:

id_vals = [r['id'] for r in ID]
#['382']

但是,一次查找一个值通常对于spark DataFrames来说是一个不好的用法。您应该考虑自己的最终目标,看看是否有更好的方法来实现它。

But looking up one value at a time is generally a bad use for spark DataFrames. You should think about your end goal, and see if there's a better way to do it.

编辑

根据您的评论,您似乎希望将name列中的值替换为另一个值。一种方法是使用 pyspark.sql.functions.when()

Base on your comments, it seems you want to replace the values in the name column with another value. One way to do this is by using pyspark.sql.functions.when().

此函数需要一个布尔列表达式作为第一个参数。我正在使用 f.col( name)== Bruce 。第二个参数是布尔表达式为 True 时应返回的内容。对于此示例,我使用 f.lit(replacement_value)

This function takes a boolean column expression as the first argument. I am using f.col("name") == "Bruce". The second argument is what should be returned if the boolean expression is True. For this example, I am using f.lit(replacement_value).

例如:

import pyspark.sql.functions as f
replacement_value = "Wayne"
df = df.withColumn(
    "name",
    f.when(f.col("name") == "Bruce", f.lit(replacement_value)).otherwise(f.col("name"))
)

这篇关于将PySpark数据框过滤器的结果收集到一个变量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆