通过传入要匹配的值列表来过滤掉DataFrame(JSON)中的嵌套数组条目 [英] Filtering out nested array entries in a DataFrame (JSON) by passing in a list of values to match against

查看:101
本文介绍了通过传入要匹配的值列表来过滤掉DataFrame(JSON)中的嵌套数组条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读了一个DataFrame,里面有一个巨大的文件,在每行上都有一个JSON对象,如下所示:

I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:

{
  "userId": "12345",
  "vars": {
    "test_group": "group1",
    "brand": "xband"
  },
  "modules": [
    {
      "id": "New"
    },
    {
      "id": "Default"
    },
    {
      "id": "BestValue"
    },
    {
      "id": "Rating"
    },
    {
      "id": "DeliveryMin"
    },
    {
      "id": "Distance"
    }
  ]
}

我想将模块ID列表传递给方法,并清除所有不属于模块ID列表的项.它应该删除所有其他模块,这些模块的id不等于传入列表中的任何值.

I would like to pass in to a method a list of module id-s and clear out all items, which don't make part of that list of module id-s. It should remove all other modules, which's id is not equal to any of the values from the passed in list.

您有解决方案吗?

推荐答案

表示modulesstruct[String]集合.对于当前要求,您必须将Array[struct[String]]转换为Array[String]

which says that modules is a collection of struct[String]. For the current requirement you will have to convert the Array[struct[String]] to Array[String]

val finaldf = df.withColumn("modules", explode($"modules.id"))
                  .groupBy("userId", "vars").agg(collect_list("modules").as("modules"))

下一步将定义udf函数为

def contains = udf((list: mutable.WrappedArray[String]) => {
  val validModules = ??? //your array definition here for example : Array("Default", "BestValue")
  list.filter(validModules.contains(_))
})

只需调用udf函数为

finaldf.withColumn("modules", contains($"modules")).show(false)

应该的.我希望答案会有所帮助.

That should be it. I hope the answer is helpful.

这篇关于通过传入要匹配的值列表来过滤掉DataFrame(JSON)中的嵌套数组条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆