如何将 SAX 与 Nokogiri 一起使用? [英] How to use SAX with Nokogiri?

查看:64
本文介绍了如何将 SAX 与 Nokogiri 一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析一个 240Mb 的非常大的文件,并且必须使用 SAX 以避免将文件加载到内存中.

I want to parse a very big file 240Mb, and have to SAX to avoid to load the file in memory.

我的 XML 看起来像:

My XML looks like:

<?xml version="1.0" encoding="utf-8"?>
<hotels>
  <hotel>
    <hotelId>1568054</hotelId>
    <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
    <hotelName>"Der Obere Wirt" zum Queri</hotelName>
    <rating>3</rating>
    <cityId>34633</cityId>
    <cityFileName>Andechs</cityFileName>
    <cityName>Andechs</cityName>
    <stateId>212</stateId>
    <stateFileName>Bavaria</stateFileName>
    <stateName>Bavaria</stateName>
    <countryCode>DE</countryCode>
    <countryFileName>Germany</countryFileName>
    <countryName>Germany</countryName>
    <imageId>51498149</imageId>
    <Address>Georg Queri Ring 9</Address>
    <minRate>85.9800</minRate>
    <currencyCode>EUR</currencyCode>
    <Latitude>48.009423000000</Latitude>
    <Longitude>11.214504000000</Longitude>
    <NumberOfReviews>16</NumberOfReviews>
    <ConsumerRating>4.25</ConsumerRating>
    <PropertyType>0</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1658359</hotelId>
    <hotelFileName>Seclusions_of_Yallingup</hotelFileName>
    <hotelName>"Seclusions" of Yallingup</hotelName>
    <rating>4</rating>
    <cityId>72257</cityId>
    <cityFileName>Yallingup</cityFileName>
    <cityName>Yallingup</cityName>
    <stateId>172</stateId>
    <stateFileName>Western_Australia</stateFileName>
    <stateName>Western Australia</stateName>
    <countryCode>AU</countryCode>
    <countryFileName>Australia</countryFileName>
    <countryName>Australia</countryName>
    <imageId>53234107</imageId>
    <Address>58 Zamia Grove</Address>
    <minRate>218.1825</minRate>
    <currencyCode>AUD</currencyCode>
    <Latitude>-33.691192000000</Latitude>
    <Longitude>115.061938999999</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>3</PropertyType>
    <ChainID>0</ChainID>
     <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1491947</hotelId>
    <hotelFileName>1_Melrose_Blvd</hotelFileName>
    <hotelName>#1 Melrose Blvd</hotelName>
    <rating>5</rating>
    <cityId>964</cityId>
    <cityFileName>Johannesburg</cityFileName>
    <cityName>Johannesburg</cityName>
    <stateId/>
    <stateFileName/>
    <stateName/>
    <countryCode>ZA</countryCode>
    <countryFileName>South_Africa</countryFileName>
    <countryName>South Africa</countryName>
    <imageId>46777171</imageId>
    <Address>1 Melrose Boulevard Melrose Arch</Address>
    <minRate/>
    <currencyCode>ZAR</currencyCode>
    <Latitude>-26.135656000000</Latitude>
    <Longitude>28.067751000000</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>9</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1726938</hotelId>
    <hotelFileName>1_Value_Inn_Clovis</hotelFileName>
    <hotelName>#1 Value Inn Clovis</hotelName>
    <rating>2</rating>
    <cityId>28538</cityId>
    <cityFileName>Clovis_New_Mexico</cityFileName>
    <cityName>Clovis (New Mexico)</cityName>
    <stateId>32</stateId>
    <stateFileName>New_Mexico</stateFileName>
    <stateName>New Mexico</stateName>
    <countryCode>US</countryCode>
    <countryFileName>United_States</countryFileName>
    <countryName>United States</countryName>
    <imageId/>
    <Address>1720 Mabry</Address>
    <minRate/>
    <currencyCode>USD</currencyCode>
    <Latitude>34.396549224853</Latitude>
    <Longitude>-103.182769775390</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>2</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
  </hotel>
</hotels>

我试过这个代码:

class Wikihandler  < Nokogiri::XML::SAX::Document

  def initialize
    # do one-time setup here, called as part of Class.new
  end

  def start_element(name, attributes = [])
  # check the element name here and create an active record object if appropriate
   if name == 'hotel'
    a = Hash[*attributes]
    puts attributes
    # more business...
   end
  end

  def characters(s)
     # save the characters that appear here and possibly use them in the current tag object
  end

  def end_element(name)
     # check the tag name and possibly use the characters you've collected
     # and save your activerecord object now
  end

end

parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new)
parser.parse_file('HotelCombinedXml/Hotels_All.xml')

我可以访问标签的标签,但如何访问其内容?

I can access the label of the tag but how can I access its content?

推荐答案

Wikihandler#characters 将显示内容.你可以这样做:

Wikihandler#characters will display the content. You could do something like:

class MyDocument < Nokogiri::XML::SAX::Document
  attr_accessor :is_name

  def initialize
    @is_name = false
  end

  def end_document
    puts "the document has ended"
  end

  def start_element name, attributes = []
    @is_name = name.eql?("hotelName")
  end

  def characters string
    string.strip!
    if @is_name and !string.empty?
      puts "Name: #{string}"
    end
  end
end

但是,如果您想让生活更轻松,我建议您查看 sax-machine.它为 Nokogiri 的 SAX 解析器添加了一些不错的功能和(恕我直言)一个更友好的界面.以下是一些示例代码和规范:

However, if you want to make your life easier, I'd suggest checking out sax-machine. It adds some nice functionality and (IMHO) a friendlier interface to Nokogiri's SAX parser. Here is some sample code and specs:

require "sax-machine"
require "rspec"

XML = <<XML
<?xml version="1.0" encoding="utf-8"?>
<hotels>
  <hotel>
    <hotelId>1568054</hotelId>
    <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
    <hotelName>"Der Obere Wirt" zum Queri</hotelName>
    <rating>3</rating>
    <cityId>34633</cityId>
    <cityFileName>Andechs</cityFileName>
    <cityName>Andechs</cityName>
    <stateId>212</stateId>
    <stateFileName>Bavaria</stateFileName>
    <stateName>Bavaria</stateName>
    <countryCode>DE</countryCode>
    <countryFileName>Germany</countryFileName>
    <countryName>Germany</countryName>
    <imageId>51498149</imageId>
    <Address>Georg Queri Ring 9</Address>
    <minRate>85.9800</minRate>
    <currencyCode>EUR</currencyCode>
    <Latitude>48.009423000000</Latitude>
    <Longitude>11.214504000000</Longitude>
    <NumberOfReviews>16</NumberOfReviews>
    <ConsumerRating>4.25</ConsumerRating>
    <PropertyType>0</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1658359</hotelId>
    <hotelFileName>Seclusions_of_Yallingup</hotelFileName>
    <hotelName>"Seclusions" of Yallingup</hotelName>
    <rating>4</rating>
    <cityId>72257</cityId>
    <cityFileName>Yallingup</cityFileName>
    <cityName>Yallingup</cityName>
    <stateId>172</stateId>
    <stateFileName>Western_Australia</stateFileName>
    <stateName>Western Australia</stateName>
    <countryCode>AU</countryCode>
    <countryFileName>Australia</countryFileName>
    <countryName>Australia</countryName>
    <imageId>53234107</imageId>
    <Address>58 Zamia Grove</Address>
    <minRate>218.1825</minRate>
    <currencyCode>AUD</currencyCode>
    <Latitude>-33.691192000000</Latitude>
    <Longitude>115.061938999999</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>3</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1491947</hotelId>
    <hotelFileName>1_Melrose_Blvd</hotelFileName>
    <hotelName>#1 Melrose Blvd</hotelName>
    <rating>5</rating>
    <cityId>964</cityId>
    <cityFileName>Johannesburg</cityFileName>
    <cityName>Johannesburg</cityName>
    <stateId/>
    <stateFileName/>
    <stateName/>
    <countryCode>ZA</countryCode>
    <countryFileName>South_Africa</countryFileName>
    <countryName>South Africa</countryName>
    <imageId>46777171</imageId>
    <Address>1 Melrose Boulevard Melrose Arch</Address>
    <minRate/>
    <currencyCode>ZAR</currencyCode>
    <Latitude>-26.135656000000</Latitude>
    <Longitude>28.067751000000</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>9</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1726938</hotelId>
    <hotelFileName>1_Value_Inn_Clovis</hotelFileName>
    <hotelName>#1 Value Inn Clovis</hotelName>
    <rating>2</rating>
    <cityId>28538</cityId>
    <cityFileName>Clovis_New_Mexico</cityFileName>
    <cityName>Clovis (New Mexico)</cityName>
    <stateId>32</stateId>
    <stateFileName>New_Mexico</stateFileName>
    <stateName>New Mexico</stateName>
    <countryCode>US</countryCode>
    <countryFileName>United_States</countryFileName>
    <countryName>United States</countryName>
    <imageId/>
    <Address>1720 Mabry</Address>
    <minRate/>
    <currencyCode>USD</currencyCode>
    <Latitude>34.396549224853</Latitude>
    <Longitude>-103.182769775390</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>2</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
  </hotel>
</hotels>
XML

class Hotel
  include SAXMachine
  element :hotelId, :as => :id
  element :hotelName, :as => :name
end

class Wikihandler
  include SAXMachine
  elements :hotel, :as => :hotels, :class => Hotel
end

describe Wikihandler do
  before(:all) do
    @parser = Wikihandler.new
    @parser.parse XML
  end

  it "should parse the proper number of hotels" do
    @parser.hotels.count.should eq 4
  end

  it "should parse the hotel id of each entry" do
    @parser.hotels[0].id.should eq "1568054"
  end

  it "should parse the hotel name of each entry" do
    @parser.hotels[0].name.should eq '"Der Obere Wirt" zum Queri'
  end
end

这篇关于如何将 SAX 与 Nokogiri 一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆