在 Java 或 Scala 中解析扁平化、属性密集型 xml 的最快方法 [英] Fastest way to parse flat, attribute-heavy xml in Java or Scala

查看:54
本文介绍了在 Java 或 Scala 中解析扁平化、属性密集型 xml 的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个像下面这样的大 xml 文件.在 Java 或 Scala 中解析它的最快方法是什么.流式传输单个元素很重要,但并非绝对必要

If I have a big xml file like the following. What would be the fastest way to parse it in Java or Scala. Streaming individual elements is important but not absolutely essential

我感兴趣的是从每个结果对象中获取属性值.

All I'm interesting in is getting the attribute values from each result object.

<Response>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
    <Result att1="1", att2="2", att3="3", att4="4", att5="5"/>
</Response>

推荐答案

Scala XML(可能很慢且内存不足)

cmbaxter 的答案在技术上是正确的,但可以使用 "flatMap that shit" 模式 :-)

    import io.Source
    import xml.pull._

    // Make it "def", because the Source is stateful and may be exhausted after it is read
    def xmlsrc=Source.fromString("""<Response>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         | </Response>""")

    // Also as "def", because the result is an iterator that may be exhausted
    def xmlEr=new XMLEventReader(xmlsrc)

    // flatMap keeps the "outer shape" of the type it operates, so we are still dealing with an iterator

    def attrs = xmlEr.flatMap{
         |   case e : EvElemStart => e.attrs.map(a => (a.key, a.value))
         |   case _ => Iterable.empty
         | }

    // Now lets look what is inside:
    attrs.foreach(println _)

    // Or just let's collect all values from "att5"
    attrs.collect{ case (name, value) if name == "att5" =>value}.foreach(println _)

扩展 XML(更快且需要更少内存)

但这不会是最快的方式.与其他解决方案(例如 基准测试 显示.但幸运的是,有一个更快、内存占用更少的解决方案:

Scales XML (faster & needs less memory)

But this will not be fastest way. The Scala API is quite slow and memory hungry compared to other solutions, like benchmarks show. But fortunately there's a faster and less memory hungry solution:

    import scales.utils._
    import ScalesUtils._
    import scales.xml._
    import ScalesXml._
    import java.io.StringReader

    def xmlsrc=new StringReader("""<Response>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         |     <Result att1="1" att2="2" att3="3" att4="4" att5="5"/>
         | </Response>""")
    def pull=pullXml(xmlsrc)
    def attributes = pull flatMap {
         |   case Left(elem : Elem) => elem.attributes
         |   case _ => Nil
         | } map (attr => (attr.name, attr.value))

    attributes.foreach(println _)

不要忘记在完成迭代器后关闭它们.这里没有必要,因为我正在使用 StringReader.

Don't forget to close you iterators after you are done with them. Here it is not necessary, because I am working with a StringReader.

还有 Anti XML 库,它在 基准测试 并且似乎有一个非常好的 API.不幸的是,我无法让它与 Scala 2.10 一起运行,因此我无法提供运行示例.

There is also the Anti XML library, which looks quite nice in benchmarks and seems to have a very nice API. Unfortunately I could not get it to run with Scala 2.10, so I cannot provide a running example.

通过以上示例,您应该能够编写一个小型测试应用程序.有了这些,您可以运行自己的基准测试.查看上面引用的基准,我想 Scales XML 可能会解决您的问题.但没有真正的衡量,这真的只是一个猜测.

With the examples above, you should be able to write a small test application. With these you can run your own benchmarks. Looking on the benchmarks quoted above, I guess that Scales XML might solve your problem. But without real meassuring, this is really only a guess.

对自己进行基准测试,也许您可​​以发布结果.

Benchmark yourself and perhaps you can post your results.

这篇关于在 Java 或 Scala 中解析扁平化、属性密集型 xml 的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆