在 Scala 中解析多行记录 [英] Parsing multiline records in Scala

查看：41 发布时间：2021/11/12 5:38:55 scala apache-spark rdd

本文介绍了在 Scala 中解析多行记录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的 RDD[String]

Here is my RDD[String]

M1 module1
PIP a Z A
PIP b Z B
PIP c Y n4

M2 module2
PIP a I n4
PIP b O D
PIP c O n5

等等.基本上，我需要一个关键的 RDD(包含第 1 行的第二个字)和可以迭代的后续 PIP 行的值.

and so on. Basically, I need a RDD of key (containing the second word on line1) and values of the subsequent PIP lines that can be iterated upon.

我已经尝试了以下

val usgPairRDD = usgRDD.map(x => (x.split("\\n")(0), x))

但这给了我以下输出

(,)
(M1 module1,M1 module1)
(PIP a Z A,PIP a Z A)
(PIP b Z B,PIP b Z B)
(PIP c Y n4,PIP c Y n4)
(,)
(M2 module2,M2 module2)
(PIP a I n4,PIP a I n4)
(PIP b O D,PIP b O D)
(PIP c O n5,PIP c O n5)

相反，我希望输出为

module1, (PIP a Z A, PIP b Z B, PIP b Z B)
module2, (PIP a I n4,PIP b O D, PIP c O n5)

我做错了什么?我对 Spark API 很陌生.谢谢

What am I doing wrong? I am quite new to Spark APIs. Thanks

嗨@zero323

usgRDD.take(10).foreach(x => println(x + "%%%%%%%%%"))

收益...

%%%%%%%%%
M1 module1%%%%%%%%%
PIP a Z A%%%%%%%%%
PIP b Z B%%%%%%%%%
PIP c Y n4%%%%%%%%%
%%%%%%%%%
M2 module2%%%%%%%%%
PIP a I n4%%%%%%%%%
PIP b O D%%%%%%%%%
PIP c O n5%%%%%%%%%

等等

嗨@zero323 和@Daniel Darabos我的输入是非常非常大的许多文件集(跨越 TB).这是示例..

Hi @zero323 and @Daniel Darabos My input is very very large set of many many files (spanning in TBs). Here is sample..

BIN n4
BIN n5
BIN D
BIN E
PIT A I A
PIT B I B 
PIT C I C
PIT D O D
PIT E O E
DEF M1 module1
   PIP a Z A
   PIP b Z B
   PIP c Y n4
DEF M2 module2
   PIP a I n4
   PIP b O D
   PIP c O n5

我需要 3 个不同的 RDDS 中的所有 BINS、PIT 和 DEF(包括下面的 PIP 行).这是我目前的做法(从讨论中，我感觉下面的 usgRDD 计算错误)

I need all the BINS, PIT and DEF (including PIP lines below) in 3 different RDDS. Here is how I am doing this currently (from the discussion, I sense usgRDD below is wrongly computed)

val binRDD = levelfileRDD.filter(line => line.contains("BIN"))
val pitRDD = levelfileRDD.filter(line => line.contains("PIT"))
val usgRDD = levelfileRDD.filter(line => !line.contains("BIN") && !line.contains("PIT")).flatMap(s=>s.split("DEF").map(_.trim))

我需要 3 种类型的 RDD(目前)，因为我需要稍后执行验证.例如，DEF M2 module2"下的n4"只有在 n4 是 BIN 元素时才能存在.从 RDD 中，我希望使用 GraphX API 来推导关系(显然我还没有达到这一点).如果每个 usgPairRDD(从 usgRDD 或其他方式计算)打印以下内容，那将是理想的

I need 3 types (at the moment) of RDDs because I need to perform validation later on. For example, "n4" under "DEF M2 module2" can only exist if n4 is a BIN element. From the RDDs, I hope to derive relationships using GraphX APIs (I have obviously not come upto this point). It would be ideal if each usgPairRDD (computed from usgRDD or otherwise) prints the following

module1, (a Z A, b Z B, c Y n4) %%%%%%%
module2, (a I n4, b O D, c O n5) %%%%%%%

我希望我说的有道理.如果我不是，请向火花之神道歉.

I hope I am making sense. Apologies to the Spark Gods, if I am not.

推荐答案

默认情况下，Spark 每行创建一个元素.这意味着在您的情况下，每条记录都分布在多个元素上，正如评论中 Daniel Darabos 所述，可以由不同的工人处理.

By default Spark creates a single element per line. It means that in your case every record is spread over multiple elements which, as stated by Daniel Darabos in the comments, can be processed by different workers.

由于看起来您的数据相对规则并由空行分隔，您应该能够使用带有自定义分隔符的 newAPIHadoopFile:

Since it looks like your data is relatively regular and separated by an empty line you should be able to use newAPIHadoopFile with custom delimiter:

import org.apache.spark.rdd.RDD
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.{LongWritable, Text}

val path: String = ???

val conf = new org.apache.hadoop.mapreduce.Job().getConfiguration
conf.set("textinputformat.record.delimiter", "\n\n")

val usgRDD = sc.newAPIHadoopFile(
    path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
  .map{ case (_, v) => v.toString }

val usgPairRDD: RDD[(String, Seq[String])] = usgRDD.map(_.split("\n") match {
  case Array(x, xs @ _*) => (x, xs)
})

在 Spark 2.4 或更高版本中，数据加载部分也可以使用 Dataset API 来实现:

In Spark 2.4 or later data loading part can be also achieved using Dataset API:

val ds: Dataset[String] = spark.read.option("lineSep", "\n\n").text(path)

这篇关于在 Scala 中解析多行记录的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Scala 中解析多行记录 [英] Parsing multiline records in Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Scala 中解析多行记录 [英] Parsing multiline records in Scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭