按 Spark DataFrame 中的数组值过滤 [英] Filter by array value in Spark DataFrame

查看:31
本文介绍了按 Spark DataFrame 中的数组值过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有 elasticsearch 的 apache spark 1.5 数据框,我尝试从包含 id 列表(数组)的列中过滤 id.

I am using apache spark 1.5 dataframe with elasticsearch, I am try to filter id from a column that contains a list(array) of ids.

例如elasticsearch列的映射如下所示:

For example the mapping of elasticsearch column is looks like this:

    {
        "people":{
            "properties":{
                "artist":{
                   "properties":{
                      "id":{
                         "index":"not_analyzed",
                         "type":"string"
                       },
                       "name":{
                          "type":"string",
                          "index":"not_analyzed",
                       }
                   }
               }
          }
    }

示例数据格式如下

{
    "people": {
        "artist": {
            [
                  {
                       "id": "153",
                       "name": "Tom"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  }
            ]
        }
    }
},
{
    "people": {
        "artist": {
            [
                  {
                       "id": "369",
                       "name": "Carl"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  },
                 {
                       "id": "698",
                       "name": "Sol"
                  }
            ]
        }
    }
}

在火花中我尝试这个:

val peopleId  = 152
val dataFrame = sqlContext.read
     .format("org.elasticsearch.spark.sql")
     .load("index/type")

dataFrame.filter(dataFrame("people.artist.id").contains(peopleId))
    .select("people_sequence.artist.id")

我得到了所有包含 152 的 id,例如 1523 、 152978 但不仅仅是 id == 152

I got all the id that is contains 152, for example 1523 , 152978 but not only id == 152

然后我尝试了

dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId))
    .select("people.artist.id")

我空了,我明白为什么,因为我有一组 people.artist.id

I get empty, I understand why, it's because I have array of people.artist.id

谁能告诉我当我有 id 列表时如何过滤?

Can anyone tell me how to filter when I have list of ids ?

推荐答案

在 Spark 1.5+ 中你可以使用 array_contains 函数:

In Spark 1.5+ you can use array_contains function:

df.where(array_contains($"people.artist.id", "153"))

如果您使用较早的版本,您可以尝试这样的 UDF:

If you use an earlier version you can try an UDF like this:

val containsId = udf(
  (rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))

这篇关于按 Spark DataFrame 中的数组值过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆