将PySpark数据框过滤器的结果收集到一个变量中 [英] Collecting the result of PySpark Dataframe filter into a variable
问题描述
我正在使用 PySpark
数据框。我的数据集包含三个属性, id
,名称
和地址
。我试图根据 name
值删除相应的行。我一直在尝试获取要删除的行的唯一 id
I am using the PySpark
dataframe. My dataset contains three attributes, id
, name
and address
. I am trying to delete the corresponding row based on the name
value. What I've been trying is to get unique id
of the row I want to delete
ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
我得到的输出如下: [Row(id ='382')]
我想知道如何使用 id
删除一行。另外,如何将数据框中的某些值替换为另一个值?例如,将所有 values == Bruce
替换为 John
I am wondering how can I use id
to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing all values == "Bruce"
with "John"
推荐答案
来自 pyspark.sql.DataFrame.collect()
,函数:
From the docs for pyspark.sql.DataFrame.collect()
, the function:
将所有记录作为行的列表返回。
Returns all the records as a list of Row.
pyspark.sql.Row
可以像字典值一样进行访问。
The fields in a pyspark.sql.Row
can be accessed like dictionary values.
以您的示例为例:
ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
#[Row(id='382')]
您可以执行以下操作来访问 id
字段:
You can access the id
field by doing:
id_vals = [r['id'] for r in ID]
#['382']
但是,一次查找一个值通常对于spark DataFrames来说是一个不好的用法。您应该考虑自己的最终目标,看看是否有更好的方法来实现它。
But looking up one value at a time is generally a bad use for spark DataFrames. You should think about your end goal, and see if there's a better way to do it.
编辑
根据您的评论,您似乎希望将name列中的值替换为另一个值。一种方法是使用 pyspark.sql.functions.when()
。
Base on your comments, it seems you want to replace the values in the name column with another value. One way to do this is by using pyspark.sql.functions.when()
.
此函数需要一个布尔列表达式作为第一个参数。我正在使用 f.col( name)== Bruce
。第二个参数是布尔表达式为 True
时应返回的内容。对于此示例,我使用 f.lit(replacement_value)
。
This function takes a boolean column expression as the first argument. I am using f.col("name") == "Bruce"
. The second argument is what should be returned if the boolean expression is True
. For this example, I am using f.lit(replacement_value)
.
例如:
import pyspark.sql.functions as f
replacement_value = "Wayne"
df = df.withColumn(
"name",
f.when(f.col("name") == "Bruce", f.lit(replacement_value)).otherwise(f.col("name"))
)
这篇关于将PySpark数据框过滤器的结果收集到一个变量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!