基于JSON数组字段的重复 [英] spark dropDuplicates based on json array field

查看:132
本文介绍了基于JSON数组字段的重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我具有以下结构的json文件:

I have json files of the following structure:

{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}

我想读取几个这样的json文件,并根据名称中的名称"列来区分它们. 我尝试过

I want to read several such json files and distinct them based on the "name" column inside names. I tried

df.dropDuplicates(Array("names.name")) 

但是它并没有发挥作用.

but it didn't do the magic.

推荐答案

这似乎是在spark 2.0中添加的回归.如果将嵌套列带到最高级别,则可以删除重复项.如果我们基于要删除的列创建新列.然后,我们删除列,最后删除列.以下功能也适用于复合键.

This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.

val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
  .dropDuplicates("DEDUP_KEY")
  .drop("DEDUP_KEY")

这篇关于基于JSON数组字段的重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆