巨大的XML文件到文本文件 [英] Huge XML file to text files

查看:142
本文介绍了巨大的XML文件到文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的XML文件(15 GB)。我想将XML文件中的text标记转换为单个页面。

I have a huge XML file(15 GB). I want to convert a 'text' tag in XML file to a single page.

示例XML文件:

<root>
    <page>
        <id> 1 </id>
        <text>
        .... 1000 to 50000 lines of text
        </text>
    </page>
    ... Like wise 2 Million `page` tags
</root>

我最初使用的是DOM解析器,但它会抛出JAVA OUT OF MEMORY(有效)。现在,我使用STAX编写了JAVA代码。它运作良好,但性能非常慢。

I've initially used DOM parser, but it throws JAVA OUT OF MEMORY(Valid). Now, I've written JAVA code using STAX. It works good, but performance is really slow.

这是我写的代码:

 XMLEventReader xMLEventReader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(filePath));
    while(xMLEventReader.hasNext()){
      xmlEvent = xMLEventReader.nextEvent();

    switch(xmlEvent.getEventType()){
    case XMLStreamConstants.START_ELEMENT:
    if( element == "text")
      isText    = true;
    break;
    case XMLStreamConstants.CHARACTERS:
      chars = (Characters) xmlEvent;
      if(! (chars.isWhiteSpace() || chars.isIgnorableWhiteSpace()))
               if(isText)
              pageContent += chars.getData() + '\n';
      break;
    case XMLStreamConstants.END_ELEMENT:
      String elementEnd = (((EndElement) xmlEvent).getName()).getLocalPart();
      if( elementEnd == "text" )
      {
          createFile(id, pageContent);
          pageContent = "";
          isText = false;
      }
      break;
    }
}

这段代码运作良好。(忽略任何未成年人)错误)。根据我的理解,XMLStreamConstants.CHARACTERS迭代文本标记的每一行。如果TEXT标记中包含10000行,则XMLStreamConstants.CHARACTERS将迭代下一行10000行。有没有更好的方法来改善性能..?

This code is working good.(Ignore about any minor errors). According to my understanding, XMLStreamConstants.CHARACTERS iterates for each and everyline of text tag. If TEXT tag has 10000 lines in it, XMLStreamConstants.CHARACTERS iterates for next 10000 lines. Is there any better way to improve the performance..?

推荐答案

什么是 pageContent ?它似乎是 String 。一个简单的优化就是使用 StringBuilder 代替;它可以附加字符串,而不必像字符串 s + = 那样创建字符串的全新副本(你也可以如果您知道开始的长度,则使用初始保留容量来构造它以减少内存重新分配和副本。)

What is pageContent? It appears to be a String. One easy optimization to make right away would be to use a StringBuilder instead; it can append strings without having to make completely new copies of the strings like Strings += does (you can also construct it with an initial reserved capacity to reduce memory reallocations and copies if you have an idea of the length to begin with).

连接 String s是一个缓慢的操作,因为字符串在Java中是不可变的;每次调用 a + = b 时,它必须分配一个新字符串,将 a 复制到其中,然后复制 b 到最后;使每个连接O(n)wrt。两个字符串的总长度。附加单个字符也是如此。另一方面, StringBuilder 在追加时具有与 ArrayList 相同的性能特征。那么你在哪里:

Concatenating Strings is a slow operation because strings are immutable in Java; each time you call a += b it must allocate a new string, copy a into it, then copy b into the end of it; making each concatenation O(n) wrt. total length of the two strings. Same goes for appending single characters. StringBuilder on the other hand has the same performance characteristics as an ArrayList when appending. So where you have:

pageContent += chars.getData() + '\n';

而是将 pageContent 更改为 StringBuilder 并且执行:

Instead change pageContent to a StringBuilder and do:

pageContent.append(chars.getData()).append('\n');

此外,如果您猜测其中一个字符串长度的上限,您可以将它传递给 StringBuilder 构造函数,以分配初始容量并减少内存重新分配和完整复制的可能性。

Also if you have a guess on the upper bound of the length of one of these strings, you can pass it to the StringBuilder constructor to allocate an initial amount of capacity and reduce the chance of a memory reallocation and full copy having to be done.

顺便说一下,另一个选择是完全跳过 StringBuilder 并将数据直接写入输出文件(假设你没有处理数据以某种方式首先)。如果执行此操作,并且性能受I / O限制,则在不同的物理磁盘上选择输出文件可能有所帮助。

Another option, by the way, is to skip the StringBuilder altogether and write your data directly to your output file (presuming you're not processing the data somehow first). If you do this, and performance is I/O-bound, choosing an output file on a different physical disk can help.

这篇关于巨大的XML文件到文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆