使用 SAX Parser 获取多个子节点? [英] Using SAX Parser to get several sub-nodes?

查看:75
本文介绍了使用 SAX Parser 获取多个子节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型本地 XML 文件 (24 GB),其结构如下:

I have a large local XML file (24 GB) with a structure like this:

<id>****</id>
<url> ****</url> (several times within an id...)

我需要这样的结果:

id1;url1
id1;url2
id1;url3
id2;url4
....

我想将 Nokigiri 与 SAX Parser 或 Reader 一起使用,因为我无法将整个文件加载到内存中.我正在使用 Ruby Rake 任务来执行代码.

I wanted to use Nokigiri either with the SAX Parser or the Reader since I can't load the whole file into memory. I am using a Ruby Rake task to execute the code.

我使用 SAX 的代码是:

My code with SAX is:

task :fetch_saxxml => :environment do

  require 'nokogiri'
  require 'open-uri'

  class MyDocument < Nokogiri::XML::SAX::Document
    attr_accessor :is_name

    def initialize
      @is_name = false
    end

    def start_element name, attributes = []
      @is_name = name.eql?("id")
    end

    def characters string
      string.strip!
      if @is_name and !string.empty?
        puts "ID: #{string}"
      end
    end

    def end_document
      puts "the document has ended"
    end

  end

  parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
  parser.parse_file('/path_to_my_file.xml')

end

这样可以获取文件中的 ID,但我也需要获取每个 id 节点中的 URL.

That is fine in order to fetch the IDs in the file but I need to fetch the URLs within each id node, too.

如何在该代码中放入类似each do"之类的内容来获取 URL 并获得如上所示的输出?或者是否可以在字符"内调用多个动作?

How do I put something like "each do" within that code to fetch the URLs and have an output like that shown above? Or is it possible to call several actions within "characters"?

推荐答案

其实这是一个解决多个节点发生时解析的解决方案.SAX 解析器的问题在于您必须找到一种方法来处理诸如&"之类的特殊字符等等......但那是另一回事了.

Actually this is a solution to parse several nodes when they occur. The problem with SAX parsers is that you have to find a way to handle special characters like "&" and so on... but that is another story.

这是我的代码:

class MyDoc < Nokogiri::XML::SAX::Document
  def start_element name, attrs = []
    @inside_content = true if name == 'yourvalue'
    @current_element = name
  end


  def characters str

    if @current_element == 'your_1st subnode'

    elsif @current_element == 'your 2nd subnode'


    end
    puts "#{@current_element} - #{str}" if @inside_content && %w{your_subnodes here}.include?(@current_element)
  end

  def end_element name
    @inside_content = false if name == 'yourvalue'
    @current_element = nil
  end

end

parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse_file('/path_to_your.xml')

end

这篇关于使用 SAX Parser 获取多个子节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆