根据条件删除DataFrame(JSON)中的嵌套数组条目 [英] Deleting nested array entries in a DataFrame (JSON) on a condition

查看:65
本文介绍了根据条件删除DataFrame(JSON)中的嵌套数组条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读了一个DataFrame,里面有一个巨大的文件,在每行上都有一个JSON对象,如下所示:

I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:

{
  "userId": "12345",
  "vars": {
    "test_group": "group1",
    "brand": "xband"
  },
  "modules": [
    {
      "id": "New"
    },
    {
      "id": "Default"
    },
    {
      "id": "BestValue"
    },
    {
      "id": "Rating"
    },
    {
      "id": "DeliveryMin"
    },
    {
      "id": "Distance"
    }
  ]
}

我如何以这种方式操纵DataFrame,以仅保留具有 id ="Default" 的模块?如果 id 不等于默认" ,如何删除所有其他内容?

How could I manipulate in such way the DataFrame, to keep only the module with id="Default" ? How to just delete all the other, if id does not equal "Default"?

推荐答案

正如您所说,每行有问题的 json 格式为

As you said you have json format given in question in each line as

{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}
{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}

如果那是真的,那么您可以使用 sqlContext json API将 json 文件读取到 dataframe 如下

If thats true then you can use sqlContext's json api to read the json file to dataframe as below

val df = sqlContext.read.json("path to json file")

应该为您提供 dataframe

+--------------------------------------------------------------------+------+--------------+
|modules                                                             |userId|vars          |
+--------------------------------------------------------------------+------+--------------+
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
+--------------------------------------------------------------------+------+--------------+

schema be

root
 |-- modules: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |-- userId: string (nullable = true)
 |-- vars: struct (nullable = true)
 |    |-- brand: string (nullable = true)
 |    |-- test_group: string (nullable = true)

最后一步是仅对 defaults 作为值的 modules.id 进行过滤

Final step would be to filter only the modules.id with Default as value

val finaldf = df.withColumn("modules", explode($"modules.id"))
    .filter($"modules" === "Default")

应该给您

+-------+------+--------------+
|modules|userId|vars          |
+-------+------+--------------+
|Default|12345 |[xband,group1]|
|Default|12345 |[xband,group1]|
+-------+------+--------------+

我希望答案会有所帮助

已更新

这将创建 json 作为

{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}

但是,如果您的要求是获得以下要求

But if your requirement is to get as below

{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}

您应该爆炸 modules ,而不是 modules.id

val finaldf = df.withColumn("modules", explode($"modules"))
    .filter($"modules.id" === "Default")

这篇关于根据条件删除DataFrame(JSON)中的嵌套数组条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆