具有多个条件的 Sparksql 过滤(使用 where 子句进行选择) [英] Sparksql filtering (selecting with where clause) with multiple conditions

查看:109
本文介绍了具有多个条件的 Sparksql 过滤(使用 where 子句进行选择)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我有以下问题:

numeric.registerTempTable("numeric"). 

我要过滤的所有值都是文字空字符串,而不是 N/A 或空值.

All the values that I want to filter on are literal null strings and not N/A or Null values.

我尝试了这三个选项:

  1. numeric_filtered = numeric.filter(numeric['LOW'] !='null').filter(numeric['HIGH'] !='null').filter(numeric['NORMAL'] != 'null')

numeric_filtered = numeric.filter(numeric['LOW'] != 'null' AND numeric['HIGH'] != 'null' AND numeric['NORMAL'] != 'null')

sqlContext.sql("SELECT * from numeric WHERE LOW != 'null' AND HIGH != 'null' AND NORMAL !='null'")

不幸的是, numeric_filtered 总是空的.我检查过,数字有应该根据这些条件过滤的数据.

Unfortunately, numeric_filtered is always empty. I checked and numeric has data that should be filtered based on these conditions.

以下是一些示例值:

低高正常

3.5 5.0 空

2.0 14.0 空

空 38.0 空

null null null

null null null

1.0 空 4.0

推荐答案

您正在使用逻辑连词 (AND).这意味着所有列都必须与 'null' 不同才能包含行.让我们以使用 filter 版本为例来说明:

Your are using logical conjunction (AND). It means that all columns have to be different than 'null' for row to be included. Lets illustrate that using filter version as an example:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

您尝试过的所有剩余方法都遵循完全相同的架构.您在这里需要的是逻辑分离 (OR).

All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

或使用原始 SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

另见:Pyspark:when 子句中的多个条件

这篇关于具有多个条件的 Sparksql 过滤(使用 where 子句进行选择)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆