斯卡拉星火:集拆分成几个RDD? [英] Scala Spark: Split collection into several RDD?

查看:1039
本文介绍了斯卡拉星火:集拆分成几个RDD?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有星火功能,可以根据一些creteria集合分成几个RDDS?这样的功能将允许以避免过度itteration。例如:

Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For example:

def main(args: Array[String]) {
    val logFile = "file.txt" 
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
    val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
  }

在这个例子中我不得不遍历logData`两次只是写在两个单独的文件的结果:

In this example I have to iterate 'logData` twice just to write results in two separate files:

    val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
    val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")

这是不是很高兴有这样的事情:

It would be nice instead to have something like this:

    val resultMap = logData.map(line => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
    resultMap.writeByKey("a", "linesA.txt") 
    resultMap.writeByKey("b", "linesB.txt")

这样的事?

推荐答案

看一看以下问题。

<一个href=\"http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job\">Write来者皆星火多个输出 - 一个火花的工作

您可以 flatMap 像以下内容,然后一个功能的RDD做了 GROUPBY 的关键。

You can flatMap an RDD with a function like the following and then do a groupBy on the key.

def multiFilter(words:List[String], line:String) = for { word <- words; if line.contains(word) } yield { (word,line) }
val filterWords = List("a","b")
val filteredRDD = logData.flatMap( line => multiFilter(filterWords, line) ) 
val groupedRDD = filteredRDD.groupBy(_._1) 

不过,这取决于你的输入RDD的大小,你可能还是看不到任何的性能提升,因为任何 GROUPBY 操作涉及洗牌。

在另一方面,如果你在你的星火集群有足够的内存,你可以缓存输入RDD,因此运​​行多个过滤操作可能不会像你想象的那样昂贵。

On the other hand if you have enough memory in your Spark cluster you can cache the input RDD and therefore running multiple filter operations may not be as expensive as you think.

这篇关于斯卡拉星火:集拆分成几个RDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆