根据条件删除DataFrame(JSON)中的嵌套数组条目 [英] Deleting nested array entries in a DataFrame (JSON) on a condition
问题描述
我读了一个DataFrame,里面有一个巨大的文件,在每行上都有一个JSON对象,如下所示:
I read in a DataFrame with a huge file holding on each line of it a JSON object as follows:
{
"userId": "12345",
"vars": {
"test_group": "group1",
"brand": "xband"
},
"modules": [
{
"id": "New"
},
{
"id": "Default"
},
{
"id": "BestValue"
},
{
"id": "Rating"
},
{
"id": "DeliveryMin"
},
{
"id": "Distance"
}
]
}
我如何以这种方式操纵DataFrame,以仅保留具有 id ="Default" 的模块?如果 id 不等于默认" ,如何删除所有其他内容?
How could I manipulate in such way the DataFrame, to keep only the module with id="Default" ? How to just delete all the other, if id does not equal "Default"?
推荐答案
正如您所说,每行有问题的 json
格式为
As you said you have json
format given in question in each line as
{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}
{"userId":"12345","vars":{"test_group":"group1","brand":"xband"},"modules":[{"id":"New"},{"id":"Default"},{"id":"BestValue"},{"id":"Rating"},{"id":"DeliveryMin"},{"id":"Distance"}]}
如果那是真的,那么您可以使用 sqlContext
的 json
API将 json
文件读取到 dataframe
如下
If thats true then you can use sqlContext
's json
api to read the json
file to dataframe
as below
val df = sqlContext.read.json("path to json file")
应该为您提供 dataframe
为
+--------------------------------------------------------------------+------+--------------+
|modules |userId|vars |
+--------------------------------------------------------------------+------+--------------+
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
|[[New], [Default], [BestValue], [Rating], [DeliveryMin], [Distance]]|12345 |[xband,group1]|
+--------------------------------------------------------------------+------+--------------+
和 schema
be
root
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
|-- userId: string (nullable = true)
|-- vars: struct (nullable = true)
| |-- brand: string (nullable = true)
| |-- test_group: string (nullable = true)
最后一步是仅对 defaults
作为值的 modules.id
进行过滤
Final step would be to filter
only the modules.id
with Default
as value
val finaldf = df.withColumn("modules", explode($"modules.id"))
.filter($"modules" === "Default")
应该给您
+-------+------+--------------+
|modules|userId|vars |
+-------+------+--------------+
|Default|12345 |[xband,group1]|
|Default|12345 |[xband,group1]|
+-------+------+--------------+
我希望答案会有所帮助
已更新
这将创建 json
作为
{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":"Default","userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
但是,如果您的要求是获得以下要求
But if your requirement is to get as below
{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
{"modules":{"id":"Default"},"userId":"12345","vars":{"brand":"xband","test_group":"group1"}}
您应该爆炸 modules
,而不是 modules.id
val finaldf = df.withColumn("modules", explode($"modules"))
.filter($"modules.id" === "Default")
这篇关于根据条件删除DataFrame(JSON)中的嵌套数组条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!