通过Spark DataFrame中的数组值进行过滤 [英] Filter by array value in Spark DataFrame

查看:701
本文介绍了通过Spark DataFrame中的数组值进行过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有弹性搜索的apache spark 1.5数据框,我试图从包含ids列表(数组)的列中过滤id。



例如弹性搜索列的映射如下所示:

  {
people:{
properties :{
artist:{
properties:{
id:{
index:not_analyzed,
type字符串
},
name:{
type:string,
index:not_analyzed,
}
}
}
}
}

示例数据格式将就像以下

  {
people:{
artist:{
[
{
id:153,
name:Tom
},
{
id:15389,
name Cok
}
]
}
}
},
{
people:{
artist :{
[
{
id:369,
name:Carl
},
{
id:15389,
name:Cok
},
{
id:698,
name Sol
}
]
}
}
}

在火花中我尝试这样:

  val peopleId = 152 
val dataFrame = sqlContext。读
.format(org.elasticsearch.spark.sql )
.load(index / type)

dataFrame.filter(dataFrame(people.artist.id)。contains(peopleId))
.select(我得到了所有包含152的id,例如1523, 152978但不仅仅是id == 152



然后我试过

  dataFrame.filter(dataFrame(people.artist.id)。等于(peopleId))
.select(people.artist.id)

我空了,我明白为什么,这是因为我有数组的people.artist.id



任何人告诉我如何清除ids列表?

解决方案

在Spark 1.5+中,您可以使用 array_contains function:

  df.where(array_contains($people.artist.id 153))

如果您使用较早的版本,可以尝试这样的UDF:

  val containsId = udf(
(rs:Seq [Row],v:Strin g)=> rs.map(_。getAs [String](id))。exists(_ == v))
df.where(containsId($people.artist,lit(153)))


I am using apache spark 1.5 dataframe with elasticsearch, I am try to filter id from a column that contains a list(array) of ids.

For example the mapping of elasticsearch column is looks like this:

    {
        "people":{
            "properties":{
                "artist":{
                   "properties":{
                      "id":{
                         "index":"not_analyzed",
                         "type":"string"
                       },
                       "name":{
                          "type":"string",
                          "index":"not_analyzed",
                       }
                   }
               }
          }
    }

The example data format will be like following

{
    "people": {
        "artist": {
            [
                  {
                       "id": "153",
                       "name": "Tom"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  }
            ]
        }
    }
},
{
    "people": {
        "artist": {
            [
                  {
                       "id": "369",
                       "name": "Carl"
                  },
                  {
                       "id": "15389",
                       "name": "Cok"
                  },
                 {
                       "id": "698",
                       "name": "Sol"
                  }
            ]
        }
    }
}

In spark I try this:

val peopleId  = 152
val dataFrame = sqlContext.read
     .format("org.elasticsearch.spark.sql")
     .load("index/type")

dataFrame.filter(dataFrame("people.artist.id").contains(peopleId))
    .select("people_sequence.artist.id")

I got all the id that is contains 152, for example 1523 , 152978 but not only id == 152

Then I tried

dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId))
    .select("people.artist.id")

I get empty, I understand why, it's because I have array of people.artist.id

Can anyone tell me how to filter when I have list of ids ?

解决方案

In Spark 1.5+ you can use array_contains function:

df.where(array_contains($"people.artist.id", "153"))

If you use an earlier version you can try an UDF like this:

val containsId = udf(
  (rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))

这篇关于通过Spark DataFrame中的数组值进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆