Sparksql过滤与多个条件(与where子句中选择) [英] Sparksql filtering (selecting with where clause) with multiple conditions

查看:15913
本文介绍了Sparksql过滤与多个条件(与where子句中选择)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我有以下问题:

numeric.registerTempTable("numeric"). 

所有我想要过滤的值是文字空字符串,而不是N / A或Null值。

All the values that I want to filter on are literal null strings and not N/A or Null values.

我想这三个选项:


  1. numeric_filtered = numeric.filter(数字['低']!='空')。过滤器(数字['HIGH']!='空')。滤波器(数字[ 正常]!='空')

numeric_filtered = numeric.filter(数字['低']!='空'和数字['HIGH']!='空'和数字['正常']!= '空')

sqlContext.sql(SELECT * FROM数字需要低!='空'AND HIGH!='空'和正常!='空')

不幸的是,numeric_filtered总是空空的。我检查和数字有应根据这些条件进行过滤的数据。

Unfortunately, numeric_filtered is always empty. I checked and numeric has data that should be filtered based on these conditions.

下面是一些样本值:

低高正常

3.5 5.0空

2.0 14.0空

空空38.0

空空空

1.0空4.0

推荐答案

您使用的是逻辑与(AND)。这意味着所有列必须比不同的'空'对列入一行。让我们举例说明,使用过滤版本为例:

Your are using logical conjunction (AND). It means that all columns have to be different than 'null' for row to be included. Lets illustrate that using filter version as an example:

numeric = sqlContext.createDataFrame([
    ('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'),  ('null', '38.0', 'null'),
    ('null', 'null', 'null'),  ('1.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

numeric_filtered_2 = numeric_filtered_1.where(
    numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null|   4.0|
## +---+----+------+

numeric_filtered_3 = numeric_filtered_2.where(
    numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()

## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+

您已经尝试了所有剩下的方法遵循完全相同的架构。你所需要的是一个逻辑析取(OR)。

All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).

from pyspark.sql.functions import col 

numeric_filtered = df.where(
    (col('LOW')    != 'null') | 
    (col('NORMAL') != 'null') |
    (col('HIGH')   != 'null'))
numeric_filtered.show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

或原始SQL:

numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
    WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()

## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0|  null|
## | 2.0|14.0|  null|
## |null|38.0|  null|
## | 1.0|null|   4.0|
## +----+----+------+

这篇关于Sparksql过滤与多个条件(与where子句中选择)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆