Spark:处理多行输入Blob [英] Spark: Process multiline input blob

查看:65
本文介绍了Spark:处理多行输入Blob的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Hadoop/Spark的新手,正在尝试将多行输入Blob处理为csv或制表符分隔格式以进行进一步处理.

I'm new to Hadoop/Spark and trying to process a multiple line input blob into a csv or tab delimited format for further processing.

示例输入

------------------------------------------------------------------------
AAA=someValueAAA1
BBB=someValueBBB1
CCC=someValueCCC1
DDD=someValueDDD1
EEE=someValueEEE1
FFF=someValueFFF1
ENDOFRECORD
------------------------------------------------------------------------
AAA=someValueAAA2
BBB=someValueBBB2
CCC=someValueCCC2
DDD=someValueDDD2
EEE=someValueEEE2
FFF=someValueFFF2
ENDOFRECORD
------------------------------------------------------------------------
AAA=someValueAAA3
BBB=someValueBBB3
CCC=someValueCCC3
DDD=someValueDDD3
EEE=someValueEEE3
FFF=someValueFFF3
GGG=someValueGGG3
HHH=someValueHHH3
ENDOFRECORD
------------------------------------------------------------------------

所需的输出

someValueAAA1, someValueBBB1, someValueCCC1, someValueDDD1, someValueEEE1, someValueFFF1
someValueAAA2, someValueBBB2, someValueCCC2, someValueDDD2, someValueEEE2, someValueFFF2
someValueAAA3, someValueBBB3, someValueCCC3, someValueDDD3, someValueEEE3, someValueFFF3

到目前为止,我仍然尝试过代码

Code ive tried so far -

#inputRDD
val inputRDD = sc.textFile("/somePath/someFile.gz")

#transform
val singleRDD = inputRDD.map(x=>x.split("ENDOFRECORD")).filter(x=>x.trim.startsWith("AAA"))


val logData = singleRDD.map(x=>{
  val rowData = x.split("\n")

  var AAA = ""
  var BBB = ""
  var CCC = ""
  var DDD = ""
  var EEE = ""
  var FFF = ""

  for (data <- rowData){
    if(data.trim().startsWith("AAA")){
      AAA = data.split("AAA=")(1)
    }else if(data.trim().startsWith("BBB")){
      BBB = data.split("BBB=")(1)
    }else if(data.trim().startsWith("CCC=")){
      CCC = data.split("CCC=")(1)
    }else if(data.trim().startsWith("DDD=")){
      DDD = data.split("DDD=")(1)
    }else if(data.trim().startsWith("EEE=")){
      EEE = data.split("EEE=")(1)
    }else if(data.trim().startsWith("FFF=")){
      FFF = data.split("FFF=")(1)
    }
  }
  (AAA,BBB,CCC,DDD,EEE,FFF)
})

logData.take(10).foreach(println)

这似乎不起作用,我得到了

This does not seem to work and i get o/p such as

AAA,,,,,,
,BBB,,,,,
,,CCC,,,,
,,,DDD,,,

似乎无法找出问题所在.我必须编写自定义输入格式来解决此问题吗?

Cant seem to figure out whats wrong here. Do i have to write a custom input format to solve this?

推荐答案

要根据您的要求处理数据:

  1. 将数据集加载为wholeTextFiles,这使您的数据集成为键,值对
  2. 将键值对转换为FlatMap,以获得单个文本集合.例如:

  1. Load the dataset as wholeTextFiles, this makes your dataset as key, value pairs
  2. Convert the key, value pair into FlatMap to obtain individual collections of text. For Example:

AAA = someValueAAA1BBB = someValueBBB1CCC = someValueCCC1DDD = someValueDDD1EEE = someValueEEE1FFF = someValueFFF1ENDOFRECORD

通过使用 \ n

尝试以下代码:

// load your data set
val data = sc.wholeTextFiles("file:///path/to/file")

val data1 = data.flatMap(x => x._2.split("ENDOFRECORD"))

val logData = data1.map(x=>{
  val rowData = x.split("\n")

  var AAA = ""
  var BBB = ""
  var CCC = ""
  var DDD = ""
  var EEE = ""
  var FFF = ""

  for (data <- rowData){
    if(data.trim().contains("AAA")){
      AAA = data.split("AAA=")(1)
    }else if(data.trim().contains("BBB")){
      BBB = data.split("BBB=")(1)
    }else if(data.trim().contains("CCC=")){
      CCC = data.split("CCC=")(1)
    }else if(data.trim().contains("DDD=")){
      DDD = data.split("DDD=")(1)
    }else if(data.trim().contains("EEE=")){
      EEE = data.split("EEE=")(1)
    }else if(data.trim().contains("FFF=")){
      FFF = data.split("FFF=")(1)
    }
  }
  (AAA,BBB,CCC,DDD,EEE,FFF)
})

logData.foreach(println)

输出:

(someValueAAA1,someValueBBB1,someValueCCC1,someValueDDD1,someValueEEE1,someValueFFF1)
(someValueAAA2,someValueBBB2,someValueCCC2,someValueDDD2,someValueEEE2,someValueFFF2)
(someValueAAA3,someValueBBB3,someValueCCC3,someValueDDD3,someValueEEE3,someValueFFF3)

这篇关于Spark:处理多行输入Blob的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆