通过Spark DataFrame中的数组值进行过滤 [英] Filter by array value in Spark DataFrame
问题描述
例如弹性搜索列的映射如下所示:
{
people:{
properties :{
artist:{
properties:{
id:{
index:not_analyzed,
type字符串
},
name:{
type:string,
index:not_analyzed,
}
}
}
}
}
示例数据格式将就像以下
{
people:{
artist:{
[
{
id:153,
name:Tom
},
{
id:15389,
name Cok
}
]
}
}
},
{
people:{
artist :{
[
{
id:369,
name:Carl
},
{
id:15389,
name:Cok
},
{
id:698,
name Sol
}
]
}
}
}
在火花中我尝试这样:
val peopleId = 152
val dataFrame = sqlContext。读
.format(org.elasticsearch.spark.sql )
.load(index / type)
dataFrame.filter(dataFrame(people.artist.id)。contains(peopleId))
.select(我得到了所有包含152的id,例如1523, 152978但不仅仅是id == 152
然后我试过
dataFrame.filter(dataFrame(people.artist.id)。等于(peopleId))
.select(people.artist.id)
我空了,我明白为什么,这是因为我有数组的people.artist.id
任何人告诉我如何清除ids列表?
解决方案在Spark 1.5+中,您可以使用 array_contains
function:
df.where(array_contains($people.artist.id 153))
如果您使用较早的版本,可以尝试这样的UDF:
val containsId = udf(
(rs:Seq [Row],v:Strin g)=> rs.map(_。getAs [String](id))。exists(_ == v))
df.where(containsId($people.artist,lit(153)))
I am using apache spark 1.5 dataframe with elasticsearch, I am try to filter id from a column that contains a list(array) of ids.
For example the mapping of elasticsearch column is looks like this:
{
"people":{
"properties":{
"artist":{
"properties":{
"id":{
"index":"not_analyzed",
"type":"string"
},
"name":{
"type":"string",
"index":"not_analyzed",
}
}
}
}
}
The example data format will be like following
{
"people": {
"artist": {
[
{
"id": "153",
"name": "Tom"
},
{
"id": "15389",
"name": "Cok"
}
]
}
}
},
{
"people": {
"artist": {
[
{
"id": "369",
"name": "Carl"
},
{
"id": "15389",
"name": "Cok"
},
{
"id": "698",
"name": "Sol"
}
]
}
}
}
In spark I try this:
val peopleId = 152
val dataFrame = sqlContext.read
.format("org.elasticsearch.spark.sql")
.load("index/type")
dataFrame.filter(dataFrame("people.artist.id").contains(peopleId))
.select("people_sequence.artist.id")
I got all the id that is contains 152, for example 1523 , 152978 but not only id == 152
Then I tried
dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId))
.select("people.artist.id")
I get empty, I understand why, it's because I have array of people.artist.id
Can anyone tell me how to filter when I have list of ids ?
解决方案 In Spark 1.5+ you can use array_contains
function:
df.where(array_contains($"people.artist.id", "153"))
If you use an earlier version you can try an UDF like this:
val containsId = udf(
(rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))
这篇关于通过Spark DataFrame中的数组值进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!