PySpark DataFrames - 使用不同类型列之间的比较进行过滤 [英] PySpark DataFrames - filtering using comparisons between columns of different types

查看:25
本文介绍了PySpark DataFrames - 使用不同类型列之间的比较进行过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一个包含各种类型(字符串、双精度...)列的数据框和一个特殊值miss",表示字符串类型列中的缺失值".

Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns.

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

pdf = pd.DataFrame([
    [1, 'miss'],
    [2, 'x'],
    [None, 'y']
], columns=['intcol', 'strcol'])

df = spark.createDataFrame(data=pdf)

我正在尝试使用这样的过滤来计算每列的非缺失值的数量:

I am trying to count the number of non-missing values for each column, using filtering like this:

col = df['strcol']
df.filter(col.isNotNull() & (col != 'miss')).show()

适用于字符串列:

+------+------+
|intcol|strcol|
+------+------+
|   2.0|     x|
|   NaN|     y|
+------+------+

但是,对于数字列,它过滤掉了所有行:

However, for the numeric column, it filters out all the rows:

col = df['intcol']
df.filter(col.isNotNull() & (col != 'miss')).show()

+------+------+
|intcol|strcol|
+------+------+
+------+------+

这似乎是因为数字列与字符串值的交叉类型比较导致全空值:

It seems like this is because the cross-type comparison of the numeric column with string value results in all-null values:

df.select(df['intcol'] != 'miss').show()

+---------------------+
|(NOT (intcol = miss))|
+---------------------+
|                 null|
|                 null|
|                 null|
+---------------------+

我觉得有点意外(例如 1 != '' 是 True,在普通"Python 中不是 null)

Which I find a bit unexpected (e.g. 1 != '' is True, not null in "normal" Python)

我的问题实际上是几个问题:

My question is really several questions:

  • 为什么交叉类型比较会导致空值?
  • 以预期方式"测试不同类型的相等/不相等的最佳方法是什么?或者(就我而言)我是否需要包含根据列类型进行切换的单独逻辑?
  • 似乎 df.filter(~df['intcol'].isin(['miss'])) 可以完成这项工作,但我想知道这是否效率较低?
  • why does the cross type comparison results in nulls?
  • what is the best way to test for equality/non-equality across different types in the "expected way"? Or (in my case) do I need to include separate logic that switches based on the type of the column?
  • It seems like df.filter(~df['intcol'].isin(['miss'])) does the job, but I wonder if that is less efficient?

推荐答案

让我们从原因开始.DataFrame API 是适用于 SQL 和 SQL 评估规则的 DSL.每当您对不同类型的对象应用运算符时,CAST 操作都会根据预定义的规则应用于较低优先级的操作数.在一般数字类型中,具有更高的优先级,因此(遵循执行计划 df.select(df['intcol'] != 'miss').explain(True)):

Let's start with why. DataFrame API is a DSL for SQL and SQL evaluation rules apply. Whenever you apply an operator on objects of different types, CAST operation is applied, according to predefined rules, on an operand of lower precedence. In general numeric types, have higher precedence, therefore (following the execution plan df.select(df['intcol'] != 'miss').explain(True)):

== Parsed Logical Plan ==
'Project [NOT (intcol#0 = miss) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

被改写为

== Analyzed Logical Plan ==
(NOT (intcol = miss)): boolean
Project [NOT (intcol#0 = cast(miss as double)) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

其中 'miss'CASTEDdouble,然后转换为 NULL

where 'miss' is CASTED to double, and later converted to NULL

== Optimized Logical Plan ==
Project [null AS (NOT (intcol = miss))#22]
+- LogicalRDD [intcol#0, strcol#1], false

因为这个操作数的强制转换是未定义的.

as cast with this operand is undefined.

由于与 NULL 相等也是未定义的 - Spark DataDrame 中 === null 和 isNull 之间的区别 - filter 产生一个空结果.

Since equality with NULL is undefined as well - Difference between === null and isNull in Spark DataDrame - filter yields an empty result.

现在如何解决这个问题.两个显式转换:

Now how to address that. Both explicit casting:

df.filter(df['intcol'].cast("string") != 'miss')

和空安全等式:

df.filter(~df['intcol'].cast("string").eqNullSafe('miss'))

应该可以解决问题.

另请注意,NaN 值不是 NULL 并且通过 Pandas 的转换是有损的 - Pandas 数据帧到 Spark 数据帧,处理 NaN 到实际空值的转换?

Also please note that NaN values are not NULL and conversion via Pandas is lossy - Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?

这篇关于PySpark DataFrames - 使用不同类型列之间的比较进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆