Spark SQL 在数组中搜索结构 [英] Spark SQL search inside an array for a struct

查看:46
本文介绍了Spark SQL 在数组中搜索结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据结构大致定义如下:

My data structure is defined approximately as follows:

schema = StructType([
# ... fields skipped
StructField("extra_features", 
ArrayType(StructType([
    StructField("key", StringType(), False),
    StructField("value", StringType(), True)
])), nullable = False)],
)

现在,我想在数组列中存在 struct {"key": "somekey", "value": "somevalue"} 的数据框中搜索条目.我该怎么做?

Now, I'd like to search for entries in a data frame where a struct {"key": "somekey", "value": "somevalue"} exists in the array column. How do I do this?

推荐答案

Spark 有一个功能 array_contains 可用于检查 ArrayType 列的内容,但不幸的是它它似乎不能处理复杂类型的数组.但是,可以使用 UDF(用户定义函数)来实现:

Spark has a function array_contains that can be used to check the contents of an ArrayType column, but unfortunately it doesn't seem like it can handle arrays of complex types. It is possible to do it with a UDF (User Defined Function) however:

from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as F

schema = StructType([StructField("extra_features", ArrayType(StructType([
    StructField("key", StringType(), False),
    StructField("value", StringType(), True)])),
    False)])

df = spark.createDataFrame([
    Row([{'key': 'a', 'value': '1'}]),
    Row([{'key': 'b', 'value': '2'}])], schema)

# UDF to check whether {'key': 'a', 'value': '1'} is in an array
# The actual data of a (nested) StructType value is a Row
contains_keyval = F.udf(lambda extra_features: Row(key='a', value='1') in extra_features, BooleanType())

df.where(contains_keyval(df.extra_features)).collect()

这导致:

[Row(extra_features=[Row(key=u'a', value=u'1')])]

您还可以使用 UDF 添加另一列来指示键值对是否存在:

You can also use the UDF to add another column that indicates whether the key-value pair is present:

df.withColumn('contains_it', contains_keyval(df.extra_features)).collect()

结果:

[Row(extra_features=[Row(key=u'a', value=u'1')], contains_it=True),
 Row(extra_features=[Row(key=u'b', value=u'2')], contains_it=False)]

这篇关于Spark SQL 在数组中搜索结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆