根据Spark中值的相似性来映射键,值对 [英] Map key, value pair based on similarity of their value in Spark

查看:112
本文介绍了根据Spark中值的相似性来映射键,值对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经学习Spark几周了,目前我正在尝试使用Scala中的Spark和Hadoop根据他们的联系将多个项目或人员分组.例如,我想根据他们的俱乐部历史来查看足球运动员的联系方式.我的玩家" rdd是:

I have been learning Spark for several weeks, currently I am trying to group several items or people based on their connection using Spark and Hadoop in Scala. For example, I want to see how football players are connected based on their club history. My "players" rdd would be:

(John, FC Sion)
(Mike, FC Sion)
(Bobby, PSV Eindhoven)
(Hans, FC Sion)

我想要这样的rdd:

(John, <Mike, Hans>)
(Mike, <John, Hans>)
(Bobby, <>)
(Hans, <Mike, John>)

我计划使用地图来实现这一目标.

I plan to use map to accomplish this.

val splitClubs = players.map(player=> (player._1, parseTeammates(player._2, players)))

其中parseTeammates是一项功能,该功能将查找也在同一俱乐部(player._2)玩的球员

Where parseTeammates is a function that will find players that are also playing for same club (player._2)

// RDD is not a type, how can I insert rdd into a function?
def parseTeammates(club: String, rdd: RDD) : List[String] = {
    // will generate a list of players that contains same "club" value
    val playerList = rdd.filter(_._1 == club)
    return playerList.values;
}

我收到编译错误,键入不匹配,因为该函数应返回List [String],但是playerList.values返回org.apache.spark.rdd.RDD [List [String]].有人可以帮我获得RDD的简单形式(在我的情况下为List [String])的值吗?

I get compilation error, type mismatch since the function is expected to return List[String] but instead playerList.values returns org.apache.spark.rdd.RDD[List[String]]. Can anybody help me to get the value of an RDD in its simple form (in my case, List[String]) ?

此外,我认为还有一种解决此问题的更优雅的方法,而不是创建一个单独的RDD,然后在新的RDD中找到某个键​​,然后将该值作为列表返回

Also, I think there is a more elegant way to solve this problem, rather than creating a separate RDD and then find a certain key in the new RDD and then returning the value as a list

推荐答案

我认为您的parseTeammates方法在RDD领域有点不完善.当涉及到RDD以及潜在的非常大量的数据时,您不想进行这种嵌套循环.尝试改组您的数据.

I think your parseTeammates approach is a little off in the world of RDDs. When it comes to dealing with RDDs and potentially really, REALLY large amount of data, you don't want to do this kind of nested looping. Try instead to re-organize your data.

下面的代码将为您提供所需的信息

The code below will get you what you want

players.map{case(player, club) => (club, List(player))}
   .reduceByKey(_++_)
   .flatMap{case(_, list) =>list.zipWithIndex.map{case(player, index) => (player, list.take(index) ++ list.drop(index+1))}}

请注意,我首先根据他们所参加的俱乐部来组织数据,然后结合球员以您想要的格式产生结果.

Note that I first organize the data according to the club they played for and then afterwards combine the players to yield the result in the format you are looking for.

我希望这会有所帮助.

这篇关于根据Spark中值的相似性来映射键,值对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆