通过传入要匹配的值列表来过滤掉DataFrame(JSON)中的嵌套数组条目 [英] Filtering out nested array entries in a DataFrame (JSON) by passing in a list of values to match against
问题描述
我读了一个DataFrame,里面有一个巨大的文件,在每行上都有一个JSON对象,如下所示:
I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:
{
"userId": "12345",
"vars": {
"test_group": "group1",
"brand": "xband"
},
"modules": [
{
"id": "New"
},
{
"id": "Default"
},
{
"id": "BestValue"
},
{
"id": "Rating"
},
{
"id": "DeliveryMin"
},
{
"id": "Distance"
}
]
}
我想将模块ID列表传递给方法,并清除所有不属于模块ID列表的项.它应该删除所有其他模块,这些模块的id不等于传入列表中的任何值.
I would like to pass in to a method a list of module id-s and clear out all items, which don't make part of that list of module id-s. It should remove all other modules, which's id is not equal to any of the values from the passed in list.
您有解决方案吗?
推荐答案
从表示modules
是struct[String]
的集合.对于当前要求,您必须将Array[struct[String]]
转换为Array[String]
which says that modules
is a collection of struct[String]
. For the current requirement you will have to convert the Array[struct[String]]
to Array[String]
val finaldf = df.withColumn("modules", explode($"modules.id"))
.groupBy("userId", "vars").agg(collect_list("modules").as("modules"))
下一步将定义udf
函数为
def contains = udf((list: mutable.WrappedArray[String]) => {
val validModules = ??? //your array definition here for example : Array("Default", "BestValue")
list.filter(validModules.contains(_))
})
只需调用udf
函数为
finaldf.withColumn("modules", contains($"modules")).show(false)
应该的.我希望答案会有所帮助.
That should be it. I hope the answer is helpful.
这篇关于通过传入要匹配的值列表来过滤掉DataFrame(JSON)中的嵌套数组条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!