pyspark 用另一个值替换数据框中的所有值 [英] pyspark replace all values in dataframe with another values

查看：30 发布时间：2021/11/14 22:39:15 python pyspark pyspark-sql

本文介绍了pyspark 用另一个值替换数据框中的所有值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的 pyspark 数据框中有 500 列……有些是字符串类型，有些是 int 类型，有些是 boolean(100 个布尔列).现在，所有布尔列都有两个不同的级别 - Yes 和 No，我想将它们转换为 1/0

I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0

对于字符串，我有三个值 - 通过、失败和空.如何用 0 替换这些空值?fillna(0) 仅适用于整数

For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers

 c1| c2 |    c3 |c4|c5..... |c500
yes| yes|passed |45....
No | Yes|failed |452....
Yes|No  |None   |32............

当我这样做

df.replace(yes,1)

我收到以下错误:

ValueError: Mixed type replacements are not supported

推荐答案

对于字符串，我有三个值——passed、failed 和 null.如何用 0 替换这些空值?fillna(0) 仅适用于整数

For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers

首先，导入when和lit

First, import when and lit

from pyspark.sql.functions import when, lit

假设你的 DataFrame 有这些列

Assuming your DataFrame has these columns

# Reconstructing my DataFrame based on your assumptions
# cols are Columns in the DataFrame
cols = ['name', 'age', 'col_with_string']

# Similarly the values
vals = [
     ('James', 18, 'passed'),
     ('Smith', 15, 'passed'),
     ('Albie', 32, 'failed'),
     ('Stacy', 33, None),
     ('Morgan', 11, None),
     ('Dwight', 12, None),
     ('Steve', 16, 'passed'), 
     ('Shroud', 22, 'passed'),
     ('Faze', 11,'failed'),
     ('Simple', 13, None)
]

# This will create a DataFrame using 'cols' and 'vals'
# spark is an object of SparkSession
df = spark.createDataFrame(vals, cols)

# We have the following DataFrame
df.show()

+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|           null|
|Morgan| 11|           null|
|Dwight| 12|           null|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|           null|
+------+---+---------------+

您可以使用:

withColumn() - 指定要使用的列.
isNull() - 评估为 true iff 属性评估为 null 的过滤器
lit() - 为文字创建一列
when(), otherwise() - 用于检查与列相关的条件

withColumn() - To specify the column you want use.
isNull() - A filter that evaluates to true iff the attribute evaluates to null
lit() - creates a column for literals
when(), otherwise() - is used to check the condition with respect to the column

我可以用 0 替换具有 null 的值

I can replace the values having null with 0

df = df.withColumn('col_with_string', when(df.col_with_string.isNull(), 
lit('0')).otherwise(df.col_with_string))

# We have replaced nulls with a '0'
df.show()

+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|              0|
|Morgan| 11|              0|
|Dwight| 12|              0|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|              0|
+------+---+---------------+

您问题的第 1 部分:是/否布尔值 - 您提到过，有 100 列布尔值.为此，我通常使用更新的值重建表或创建一个 UDF 返回 1 或 0 表示是或否.

Part 1 of your question: Yes/No boolean values - you mentioned that, there are 100 columns of Boolean's. For this, I generally reconstruct the table with updated values or create a UDF returns 1 or 0 for Yes or No.

我正在向 DataFrame (df) 中添加另外两列 can_vote 和 can_lotto

I am adding two more columns can_vote and can_lotto to the DataFrame (df)

df = df.withColumn("can_vote", col('Age') >= 18)
df = df.withColumn("can_lotto", col('Age') > 16) 

# Updated DataFrame will be
df.show()

+------+---+---------------+--------+---------+
|  name|age|col_with_string|can_vote|can_lotto|
+------+---+---------------+--------+---------+
| James| 18|         passed|    true|     true|
| Smith| 15|         passed|   false|    false|
| Albie| 32|         failed|    true|     true|
| Stacy| 33|              0|    true|     true|
|Morgan| 11|              0|   false|    false|
|Dwight| 12|              0|   false|    false|
| Steve| 16|         passed|   false|    false|
|Shroud| 22|         passed|    true|     true|
|  Faze| 11|         failed|   false|    false|
|Simple| 13|              0|   false|    false|
+------+---+---------------+--------+---------+

假设您有与 can_vote 和 can_lotto 相似的列(布尔值为 Yes/No)

Assuming you have similar columns to can_vote and can_lotto (boolean values being Yes/No)

您可以使用以下代码行获取 DataFrame 中布尔类型的列

You can use the following line of code to fetch the columns in the DataFrame having boolean type

col_with_bool = [item[0] for item in df.dtypes if item[1].startswith('boolean')]

返回一个列表

['can_vote', 'can_lotto']

您可以创建一个 UDF 并针对此类列表中的每一列进行迭代，使用 1(是)或 0(否)点亮每一列.

You can create a UDF and iterate for each column in this type of list, lit each of the columns using 1 (Yes) or 0 (No).

参考，参考以下链接

isNull():https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/sources/IsNull.html
点亮，什么时候:https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html

这篇关于pyspark 用另一个值替换数据框中的所有值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark 用另一个值替换数据框中的所有值 [英] pyspark replace all values in dataframe with another values

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pyspark 用另一个值替换数据框中的所有值 [英] pyspark replace all values in dataframe with another values

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭