在星火XML处理 [英] Xml processing in Spark

查看:166
本文介绍了在星火XML处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

场景:
我的投入将是多个小的XML的和我应该读取这些XML的作为RDDS。执行与其他数据集加盟,形成一个RDD并发送输出的XML。

Scenario: My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.

是否有可能利用火花读取XML,加载数据RDD?如果有可能将如何将XML被读取。

Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.

示例XML:

<root><users> <user>
                  <account>1234<\account>
                  <name>name_1<\name>
                  <number>34233<\number>   <\user>   <user>
                  <account>58789<\account>
                  <name>name_2<\name>
                  <number>54697<\number>   <\user><\users><\root>

这将如何被加载到RDD?

How will this be loaded into the RDD?

推荐答案

这是可能的,但细节将取决于你采取的做法有所不同。

Yes it possible but details will differ depending on an approach you take.


  • 如果文件很小,因为你已经提到的,最简单的办法是使用加载数据 SparkContext.wholeTextFiles 。它加载数据 RDD [(字符串,字符串)] ,其中第一个元素是路径和第二个文件的内容。然后,解析每个文件就像在本地操作模式。

  • Hadoop的输入格式可以使用的文件。
    • If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
    • For larger files you can use Hadoop input formats.
      • If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
      • Otherwise Mahout provides XmlInputFormat

      最后,它可以读取文件中使用 SparkContext.textFile ,后来调整为备案分区之间跨越。概念上,它意味着类似于创建滑动窗口或分割记录成固定大小组的内容:

      Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:


      • 使用 mapPartitionsWithIndex 分区识别分区之间打破纪录,收集打破纪录

      • 使用第二个 mapPartitionsWithIndex 来修复损坏的记录

      • use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
      • use second mapPartitionsWithIndex to repair broken records

      修改

      也有比较新的 火花CSV 软件包,允许你提取按标签特定的记录:

      There is also relatively new spark-csv package which allows you to extract specific records by tag:

      val df = sqlContext.read
        .format("com.databricks.spark.xml")
         .option("rowTag", "foo")
         .load("bar.xml")
      

      这篇关于在星火XML处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆