一个火花的工作 - 通过按键星火写入到多个输出 [英] Write to multiple outputs by key Spark - one Spark job

查看:316
本文介绍了一个火花的工作 - 通过按键星火写入到多个输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你怎么能写依赖于一个单一的作业中使用的Spark关键多个输出。当然,我可以使用 .filter 所有可能的密钥,但是这是一个可怕的黑客,这将激发了许多就业机会,并且需要多次迭代对数据集 - 这是即使它缓存一个黑客

How can you write to multiple outputs dependent on the key using Spark in a single Job. I could of course use .filter for all the possible keys, but that is a horrible hack, which will fire up many jobs and require multiple iterations over the data set - which is a hack even if it's cached

相关报道:<一href=\"http://stackoverflow.com/questions/23994383/write-to-multiple-outputs-by-key-scalding-hadoop-one-ma$p$pduce-job/\">Write来者皆烫Hadoop的多个输出,一个马preduce工作

例如

sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))
.writeAsMultiple(prefix, compressionCodecOption)

将确保猫preFIX / 1

a
b

猫preFIX / 2

c

我知道这个问题的答案将涉及Hadoop的使用MultipleOutputFormat。

I know that the answer to this question will involve using MultipleOutputFormat in Hadoop.

更新:请不要放置在输出结果,例如任何限制提供一个解决方案,其中的文件的数目是固定的,或者键的数目必须是已知的先验,或在玉米pression类型是有限的。

UPDATE: Please do not place any limitations on the resulting output, e.g. supply a solution where the number of files is fixed, or where the number of keys must be known a priori, or where the compression type is limited.

更新:在滚烫的,这是现在超级容易的感谢 TemplatedTsv 我想答案就这样

UPDATES: In Scalding this is now super easy thanks to TemplatedTsv I want an answer just like that!

推荐答案

我会做这样的,且可扩展

I would do it like this which is scalable

import org.apache.hadoop.io.NullWritable

import org.apache.spark._
import org.apache.spark.SparkContext._

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
  override def generateActualKey(key: Any, value: Any): Any = 
    NullWritable.get()

  override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = 
    key.asInstanceOf[String]
}

object Split {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Split" + args(1))
    val sc = new SparkContext(conf)
    sc.textFile("input/path")
    .map(a => (k, v)) // Your own implementation
    .partitionBy(new HashPartitioner(num))
    .saveAsHadoopFile("output/path", classOf[String], classOf[String],
      classOf[RDDMultipleTextOutputFormat])
    spark.stop()
  }
}

刚看到上面类似的答案,但实际上我们并不需要定制的分区。该MultipleTextOutputFormat将每个键创建文件。它是确定与相同的密钥多个记录落入同一分区。

Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.

新HashPartitioner(NUM),其中num是你想要的分区号。如果你有不同的密钥的一个大数目,可以设置数量大。在这种情况下,每个分区将不打开太多HDFS文件处理程序。

new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.

这篇关于一个火花的工作 - 通过按键星火写入到多个输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆