如何检测Spark DataFrame是否具有列 [英] How do I detect if a Spark DataFrame has a column

查看:1015
本文介绍了如何检测Spark DataFrame是否具有列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在Spark SQL中从JSON文件创建DataFrame时,如何在调用.select

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select

示例JSON模式:

{
  "a": {
    "b": 1,
    "c": 2
  }
}

这就是我想要做的:

potential_columns = Seq("b", "c", "d")
df = sqlContext.read.json(filename)
potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))

但是我找不到hasColumn的良好功能.我得到的最接近的结果是测试该列是否在这个有点尴尬的数组中:

but I can't find a good function for hasColumn. The closest I've gotten is to test if the column is in this somewhat awkward array:

scala> df.select("a.*").columns
res17: Array[String] = Array(b, c)

推荐答案

只需假定它存在,并使用Try使其失败.简单明了,支持任意嵌套:

Just assume it exists and let it fail with Try. Plain and simple and supports an arbitrary nesting:

import scala.util.Try
import org.apache.spark.sql.DataFrame

def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess

val df = sqlContext.read.json(sc.parallelize(
  """{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil))

hasColumn(df, "foobar")
// Boolean = false

hasColumn(df, "foo")
// Boolean = true

hasColumn(df, "foo.bar")
// Boolean = true

hasColumn(df, "foo.bar.foobar")
// Boolean = true

hasColumn(df, "foo.bar.foobaz")
// Boolean = false

或更简单:

val columns = Seq(
  "foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz")

columns.flatMap(c => Try(df(c)).toOption)
// Seq[org.apache.spark.sql.Column] = List(
//   foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)

等效于Python:

from pyspark.sql.utils import AnalysisException
from pyspark.sql import Row


def has_column(df, col):
    try:
        df[col]
        return True
    except AnalysisException:
        return False

df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF()

has_column(df, "foobar")
## False

has_column(df, "foo")
## True

has_column(df, "foo.bar")
## True

has_column(df, "foo.bar.foobar")
## True

has_column(df, "foo.bar.foobaz")
## False

这篇关于如何检测Spark DataFrame是否具有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆