Java:使用SAXParser拆分大型XML文件 [英] Java: splitting up a large XML file with SAXParser

查看:206
本文介绍了Java:使用SAXParser拆分大型XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用java的 SAXParser (特别是未压缩的大约28GB的维基百科转储)将大型XML文件拆分为较小的文件。

I am trying to split a large XML file into smaller files using java's SAXParser (specifically the wikipedia dump which is about 28GB uncompressed).

我有一个 Pagehandler 类,它扩展了 DefaultHandler

I have a Pagehandler class which extends DefaultHandler:

private class PageHandler extends DefaultHandler {

   private StringBuffer text;
   ...

  @Override
  public void startElement(String uri, String localName, String qName, Attributes attributes) {

        text.append("<" + qName + ">");
  }

  @Override
  public void endElement(String uri, String localName, String qName) {

        text.append("</" + qName + ">");

        if (qName.equals("page")) {
            text.append("\n");
            pageCount++;
            writePage();
        }

        if (pageCount >= maxPages) {
            rollFile();
        }
    }

  @Override
  public void characters(char[] chars, int start, int length) {
        for (int i = start; i < start + length; i++) {
            text.append(chars[i]);
        }
    }
}

所以我可以写出元素内容没问题。我的问题是如何获取元素标签和属性 - 似乎没有报告这些字符。充其量我将不得不从作为参数传递到 startElement 的内容重建这些 - 这看起来有点痛苦。或者有更简单的方法吗?

So I can write out element content no problem. My problem is how to get the element tags and attributes - these characters do not seem to be reported. At best I will have to reconstruct these from what's passed as arguments to startElement - which seems a bit of a a pain. Or is there an easier way?

我想做的就是遍历文件并将其写出来,每隔一段时间就滚动输出文件。这有多难:)

All I want to do is loop through the file and write it out, rolling the output file every-so-often. How hard can this be :)

谢谢

推荐答案

我我不太确定我完全理解你要做的是什么,但是要将限定名称作为字符串,你只需要 qName.toString()并获取属性名称你只需 atts.getQName(int index)

I'm not quite sure I totally understand what you are trying to do but to get the qualified name as a string you simply do qName.toString() and to get the attributes name you just do atts.getQName(int index).

这篇关于Java:使用SAXParser拆分大型XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆