Spark SQL的where子句排除空值 [英] Spark SQL's where clause excludes null values
问题描述
我正在尝试在Apache Spark sql上运行查询.第一个查询工作正常,但是第二个查询也删除了空值.
I am trying to run queries on Apache spark sql. The first query works fine, but the second query removes null values also.
代码:
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "Spark")
val sqlContext = new SQLContext(sc)
val pageViewsDF = getDataframe(sc, sqlContext)
println("RUNNING SQL QUERIES ")
sqlContext.sql("select name , count(*) from pageviews_by_second group by name").show(10)
sqlContext.sql("select name , count(*) from pageviews_by_second where name not in (\"Rose\") group by name").show(10)
}
def getDataframe(sc: SparkContext, sqlContext: SQLContext): DataFrame = {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
val dataArray = List(List("David", null),
List("David", null),
List("Charlie", "23"),
List("Rose", null),
List("Ben", null),
List("Harry", "43"),
List(null, "25"),
List(null, "21"),
List("David", "15"),
List("Rose", null),
List("Alan", "26"))
val separator = ","
// Create an RDD
val dataRDD = sc.parallelize(dataArray)
// The schema is encoded in a string
val header = "name,age"
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
header.split(separator).map { fieldName =>
StructField(fieldName, StringType, true)
})
val rowRDD =
dataRDD
.map(p => Row(p(0), p(1)))
// Apply the schema to the RDD.
var df = sqlContext.createDataFrame(rowRDD, schema)
df.registerTempTable("pageviews_by_second")
df
}
第一个查询的结果是:
+-------+---+
| name|_c1|
+-------+---+
| Alan| 1|
| Ben| 1|
| David| 3|
|Charlie| 1|
| Rose| 2|
| Harry| 1|
| null| 2|
+-------+---+
第二个查询的输出:
+-------+---+
| name|_c1|
+-------+---+
| Alan| 1|
| Ben| 1|
| David| 3|
|Charlie| 1|
| Harry| 1|
+-------+---+
在第二个查询中,我仅排除"Rose",但也排除了"null".
In the second query I am excluding "Rose" only but "null" is also getting excluded .
如果我的查询错误,请帮助我进行正确的查询.
If my query is wrong please help me with the correct query.
推荐答案
发生这种情况是因为SQL中的NULL
等效于未知".这意味着除IS NULL
/IS NOT NULL
之外,与NULL
的任何比较均未定义并返回NULL
.
It happens because NULL
in SQL is equivalent to "unknown". It means that any comparison with NULL
, other than IS NULL
/ IS NOT NULL
is undefined and returns NULL
.
case class Record(id: Integer, value: String)
val df = sc.parallelize(Seq(Record(1, "foo"), Record(2, null))).toDF
df.registerTempTable("df")
sqlContext.sql("""SELECT value = "foo" FROM df""").show
// +----+
// | _c0|
// +----+
// |true|
// |null|
// +----+
sqlContext.sql("""SELECT value != "foo" FROM df""").show
// +-----+
// | _c0|
// +-----+
// |false|
// | null|
// +-----+
因为IN
/NOT IN
也是未定义的,所以:
Because of that IN
/ NOT IN
is undefined as well:
sqlContext.sql("""SELECT value IN ("foo", "bar") FROM df""").show
// +----+
// | _c0|
// +----+
// |true|
// |null|
// +----+
这是标准的SQL行为,正确实施SQL标准的系统应以相同的方式运行.如果要过滤并保留NULLs
,则必须明确地设置它:
This is a standard SQL behavior and system that correctly implements SQL standard should behave the same way. If you to filter and keep NULLs
you'll have to make it explicitly:
sqlContext.sql(
"""SELECT value IN ("foo", "bar") OR value IS NULL FROM df""").show
// +----+
// | _c0|
// +----+
// |true|
// |true|
// +----+
这篇关于Spark SQL的where子句排除空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!