PySpark DataFrames-使用不同类型的列之间的比较进行过滤 [英] PySpark DataFrames - filtering using comparisons between columns of different types

查看:78
本文介绍了PySpark DataFrames-使用不同类型的列之间的比较进行过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一个数据框,其中包含各种类型的列(字符串,双精度列...)和一个特殊值"miss",该值代表字符串类型的列中的"missing value".

Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns.

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

pdf = pd.DataFrame([
    [1, 'miss'],
    [2, 'x'],
    [None, 'y']
], columns=['intcol', 'strcol'])

df = spark.createDataFrame(data=pdf)

我正在尝试使用以下过滤方法来计算每列的非缺失值的数量:

I am trying to count the number of non-missing values for each column, using filtering like this:

col = df['strcol']
df.filter(col.isNotNull() & (col != 'miss')).show()

哪个适用于字符串列:

+------+------+
|intcol|strcol|
+------+------+
|   2.0|     x|
|   NaN|     y|
+------+------+

但是,对于数字列,它会过滤掉所有行:

However, for the numeric column, it filters out all the rows:

col = df['intcol']
df.filter(col.isNotNull() & (col != 'miss')).show()

+------+------+
|intcol|strcol|
+------+------+
+------+------+

这似乎是因为数字列与字符串值的交叉类型比较会导致全空值:

It seems like this is because the cross-type comparison of the numeric column with string value results in all-null values:

df.select(df['intcol'] != 'miss').show()

+---------------------+
|(NOT (intcol = miss))|
+---------------------+
|                 null|
|                 null|
|                 null|
+---------------------+

我发现有些意外(例如 1!= 是True,在正常" Python中不是null)

Which I find a bit unexpected (e.g. 1 != '' is True, not null in "normal" Python)

我的问题实际上是几个问题:

My question is really several questions:

  • 为什么交叉类型比较结果为空?
  • 以预期方式"测试不同类型之间是否相等的最佳方法是什么?还是(就我而言)我是否需要包括可根据列类型进行切换的单独逻辑?
  • 似乎 df.filter(〜df ['intcol'].isin(['miss']))可以完成这项工作,但我想知道这是否效率较低?
  • why does the cross type comparison results in nulls?
  • what is the best way to test for equality/non-equality across different types in the "expected way"? Or (in my case) do I need to include separate logic that switches based on the type of the column?
  • It seems like df.filter(~df['intcol'].isin(['miss'])) does the job, but I wonder if that is less efficient?

推荐答案

让我们从为什么开始. DataFrame API是用于SQL的DSL,并且适用于SQL评估规则.每当您对不同类型的对象应用运算符时,根据预定义规则, CAST 操作都会应用于优先级较低的操作数.因此,在一般的数字类型中,优先级较高(按照执行计划 df.select(df ['intcol']!='miss').explain(True)):

Let's start with why. DataFrame API is a DSL for SQL and SQL evaluation rules apply. Whenever you apply an operator on objects of different types, CAST operation is applied, according to predefined rules, on an operand of lower precedence. In general numeric types, have higher precedence, therefore (following the execution plan df.select(df['intcol'] != 'miss').explain(True)):

== Parsed Logical Plan ==
'Project [NOT (intcol#0 = miss) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

被重写为

== Analyzed Logical Plan ==
(NOT (intcol = miss)): boolean
Project [NOT (intcol#0 = cast(miss as double)) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

其中'miss' CASTED double ,然后转换为 NULL

where 'miss' is CASTED to double, and later converted to NULL

== Optimized Logical Plan ==
Project [null AS (NOT (intcol = miss))#22]
+- LogicalRDD [intcol#0, strcol#1], false

使用此操作数进行强制转换是不确定的.

as cast with this operand is undefined.

由于也未定义与 NULL 的相等性- Spark DataDrame中=== null和isNull之间的差异- filter 产生空结果.

Since equality with NULL is undefined as well - Difference between === null and isNull in Spark DataDrame - filter yields an empty result.

现在如何解决这个问题.两种显式转换:

Now how to address that. Both explicit casting:

df.filter(df['intcol'].cast("string") != 'miss')

和空安全相等性:

df.filter(~df['intcol'].cast("string").eqNullSafe('miss'))

应该可以解决问题.

还请注意, NaN 值不是 NULL ,并且通过Pandas进行的转换是有损的-

Also please note that NaN values are not NULL and conversion via Pandas is lossy - Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?

这篇关于PySpark DataFrames-使用不同类型的列之间的比较进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆