Spark 中的 XML 处理 [英] Xml processing in Spark

查看:35
本文介绍了Spark 中的 XML 处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

场景:我的输入将是多个小的 XML,我应该将这些 XML 作为 RDD 读取.执行与另一个数据集的连接并形成 RDD 并将输出作为 XML 发送.

Scenario: My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.

是否可以使用 spark 读取 XML,将数据加载为 RDD?如果可能,将如何读取 XML.

Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.

示例 XML:

<root>
    <users>
        <user>
              <account>1234<\account>
              <name>name_1<\name>
              <number>34233<\number>
         <\user>
         <user>
              <account>58789<\account>
              <name>name_2<\name>
              <number>54697<\number>
         <\user>    
    <\users>
<\root>

这将如何加载到 RDD 中?

How will this be loaded into the RDD?

推荐答案

是的,但细节会因您采用的方法而异.

Yes it possible but details will differ depending on an approach you take.

  • 如果文件很小,正如您所提到的,最简单的解决方案是使用 SparkContext.wholeTextFiles 加载数据.它将数据加载为 RDD[(String, String)] ,其中第一个元素是路径,第二个元素是文件内容.然后像在本地模式中一样单独解析每个文件.
  • 对于较大的文件,您可以使用 Hadoop 输入格式.
    • 如果结构简单,您可以使用 textinputformat.record.delimiter 拆分记录.您可以在此处找到一个简单的示例.输入不是 XML,但您应该提供它并了解如何继续
    • 否则 Mahout 提供 XmlInputFormat
    • If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
    • For larger files you can use Hadoop input formats.
      • If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
      • Otherwise Mahout provides XmlInputFormat

      最后,可以使用 SparkContext.textFile 读取文件,并在稍后调整分区之间的记录跨度.从概念上讲,这意味着类似于创建滑动窗口或将记录分成固定大小的组:

      Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:

      • 使用mapPartitionsWithIndex 分区来识别分区之间的损坏记录,收集损坏的记录
      • 使用第二个 mapPartitionsWithIndex 修复损坏的记录
      • use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
      • use second mapPartitionsWithIndex to repair broken records

      编辑:

      还有相对较新的 spark-xml 包,它允许您可以通过标签提取特定记录:

      There is also relatively new spark-xml package which allows you to extract specific records by tag:

      val df = sqlContext.read
        .format("com.databricks.spark.xml")
         .option("rowTag", "foo")
         .load("bar.xml")
      

      这篇关于Spark 中的 XML 处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆