如何结合两个RDD [String] s的索引方式? [英] How to combine two RDD[String]s index-wise?

查看:101
本文介绍了如何结合两个RDD [String] s的索引方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark RDD,并创建了两个标识长度数组,一个是鸣叫的时间,另一个是鸣叫的文本.我希望将它们组合成一个数据结构(也许是一个元组?),可以按推文的小时和文本进行过滤,但是在组合如何执行此方法后,我一直很挣扎.

I'm working with Spark RDDs and created two idential length arrays, one is the hour of tweet, and the other is the text of a tweet. I'm looking to combine these into one data structure (perhaps a tuple?) that I can filter by the hour and text of tweets, but I'm struggling after combining on how to perform this.

scala> val split_time = split_date.map(line => line.split(":")).map(word =>
(word(0)))
split_time: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[28] at map 
at <console>:31

scala> split_time.take(10)
res8: Array[String] = Array(17, 17, 17, 17, 17, 17, 17, 17, 17, 17)


scala> val split_text = text.map(line => line.split(":")).map(word => 
(word(1).toLowerCase))
split_text: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at map at <console>:29

scala> split_text.take(10)
res0: Array[String] = Array("add @joemontana to this pic and you've got 
something #nfl https, "are you looking for photo editor, "#colts frank gore 
needs 27 rushing yards to pass jerome bettis and 49 yards to pass ladainian 
tomlinson to move int… https, "rt @nflstreamfree,.....

# combine into tuple
val tweet_tuple = (split_time, split_text)

例如,我想获取第17小时的所有推文,其中提到了单词"colts":

For example, I want to get all tweets for hour 17 with the word "colts" mentioned:

tweet_tuple.filter(tup => tup._1 == 17 && tup._2.toString.matches("colts"))

<console>:40: error: value filter is not a member of (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String])
          tweet_tuple.map(line => line._1 == 17 && line._2.toString.matches("colts"))

推荐答案

您应该使用.zip将两个rdds合并为RDD[(String, String)]

You should go with .zip to combine both rdds into RDD[(String, String)]

例如,我创建了两个rdds

for example I created two rdds

val split_time = sparkContext.parallelize(Array("17", "17", "17", "17", "17", "17", "17", "17", "17", "17"))
val split_text = sparkContext.parallelize(Array("17", "17", "17", "17", "colts", "17", "17", "colts", "17", "17"))

zip将上述两个rdds合并为RDD[Tuple2[String, String]]

val tweet_tuple = split_time.zip(split_text)

结合所有之后,您只需应用.filter

After combining all you need is to apply .filter

tweet_tuple.filter(line => line._1 == "17" && line._2.toString.matches("colts"))

输出应为

(17,colts)
(17,colts)

已更新

由于您的split_text rdd 句子的集合,因此应使用contains代替matches.因此,在您zip踩踏之后,以下逻辑应该可以工作.

Since your split_text rdd are collection of sentences, contains should be used instead of matches. So the following logic should work after you've zipped.

tweet_tuple.filter(line => line._1 == "17" && line._2.toString.contains("colts"))

这篇关于如何结合两个RDD [String] s的索引方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆