pyspark用另一个值替换数据框中的所有值 [英] pyspark replace all values in dataframe with another values
问题描述
我的pyspark数据框中有500列...有些是字符串类型,有些是int值,有些是布尔型(100个布尔型列). 现在,所有布尔值列都有两个不同的级别-是和否,我想将它们转换为1/0
I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0
对于字符串,我有三个值:passed,failure和null. 如何将这些空值替换为0? fillna(0)仅适用于整数
For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers
c1| c2 | c3 |c4|c5..... |c500
yes| yes|passed |45....
No | Yes|failed |452....
Yes|No |None |32............
当我这样做
df.replace(yes,1)
我收到以下错误:
ValueError: Mixed type replacements are not supported
推荐答案
对于字符串,我有三个值:passed,failed和null.如何将这些空值替换为0? fillna(0)仅适用于整数
For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers
首先,导入何时并点亮
First, import when and lit
from pyspark.sql.functions import when, lit
假设您的DataFrame有这些列
Assuming your DataFrame has these columns
# Reconstructing my DataFrame based on your assumptions
# cols are Columns in the DataFrame
cols = ['name', 'age', 'col_with_string']
# Similarly the values
vals = [
('James', 18, 'passed'),
('Smith', 15, 'passed'),
('Albie', 32, 'failed'),
('Stacy', 33, None),
('Morgan', 11, None),
('Dwight', 12, None),
('Steve', 16, 'passed'),
('Shroud', 22, 'passed'),
('Faze', 11,'failed'),
('Simple', 13, None)
]
# This will create a DataFrame using 'cols' and 'vals'
# spark is an object of SparkSession
df = spark.createDataFrame(vals, cols)
# We have the following DataFrame
df.show()
+------+---+---------------+
| name|age|col_with_string|
+------+---+---------------+
| James| 18| passed|
| Smith| 15| passed|
| Albie| 32| failed|
| Stacy| 33| null|
|Morgan| 11| null|
|Dwight| 12| null|
| Steve| 16| passed|
|Shroud| 22| passed|
| Faze| 11| failed|
|Simple| 13| null|
+------+---+---------------+
您可以使用:
- withColumn()-指定要使用的列.
- isNull()-一个评估为 true iff 该属性评估为null 的过滤器
- lit()-为文字创建一列
- when(), otherwise()-用于检查有关列的条件
- withColumn() - To specify the column you want use.
- isNull() - A filter that evaluates to true iff the attribute evaluates to null
- lit() - creates a column for literals
- when(), otherwise() - is used to check the condition with respect to the column
我可以将具有null的值替换为0
I can replace the values having null with 0
df = df.withColumn('col_with_string', when(df.col_with_string.isNull(),
lit('0')).otherwise(df.col_with_string))
# We have replaced nulls with a '0'
df.show()
+------+---+---------------+
| name|age|col_with_string|
+------+---+---------------+
| James| 18| passed|
| Smith| 15| passed|
| Albie| 32| failed|
| Stacy| 33| 0|
|Morgan| 11| 0|
|Dwight| 12| 0|
| Steve| 16| passed|
|Shroud| 22| passed|
| Faze| 11| failed|
|Simple| 13| 0|
+------+---+---------------+
问题的第1部分:是/否布尔值-您提到过,有100列布尔值.为此,我通常使用更新后的值来重建表,或者创建UDF,对于是"或否"返回1或0.
Part 1 of your question: Yes/No boolean values - you mentioned that, there are 100 columns of Boolean's. For this, I generally reconstruct the table with updated values or create a UDF returns 1 or 0 for Yes or No.
我要在DataFrame(df)中再添加两列can_vote和can_lotto
I am adding two more columns can_vote and can_lotto to the DataFrame (df)
df = df.withColumn("can_vote", col('Age') >= 18)
df = df.withColumn("can_lotto", col('Age') > 16)
# Updated DataFrame will be
df.show()
+------+---+---------------+--------+---------+
| name|age|col_with_string|can_vote|can_lotto|
+------+---+---------------+--------+---------+
| James| 18| passed| true| true|
| Smith| 15| passed| false| false|
| Albie| 32| failed| true| true|
| Stacy| 33| 0| true| true|
|Morgan| 11| 0| false| false|
|Dwight| 12| 0| false| false|
| Steve| 16| passed| false| false|
|Shroud| 22| passed| true| true|
| Faze| 11| failed| false| false|
|Simple| 13| 0| false| false|
+------+---+---------------+--------+---------+
假设您具有与can_vote和can_lotto相似的列(布尔值为是/否")
Assuming you have similar columns to can_vote and can_lotto (boolean values being Yes/No)
您可以使用下面的代码行来获取具有布尔类型的DataFrame中的列
You can use the following line of code to fetch the columns in the DataFrame having boolean type
col_with_bool = [item[0] for item in df.dtypes if item[1].startswith('boolean')]
这将返回一个列表
['can_vote', 'can_lotto']
您可以创建一个UDF并为这种类型的列表中的每一列进行迭代,并使用1(是)或0(否)点亮每个列.
You can create a UDF and iterate for each column in this type of list, lit each of the columns using 1 (Yes) or 0 (No).
作为参考,请参考以下链接
For reference, refer to the following links
- isNull(): https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/sources/IsNull.html
- lit, when: https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
这篇关于pyspark用另一个值替换数据框中的所有值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!