如何使用 Nokogiri 在两个 HTML 注释之间抓取 HTML? [英] How do I scrape HTML between two HTML comments using Nokogiri?

查看：74 发布时间：2021/6/8 18:46:55 ruby-on-rails ruby web-scraping web-crawler nokogiri

本文介绍了如何使用 Nokogiri 在两个 HTML 注释之间抓取 HTML?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些 HTML 页面，其中要提取的内容标有 HTML 注释，如下所示.

.....<!-- 开始内容--><div>一些文本</div><div><p>更多元素</p></div><!-- 结束内容-->...</html>

我正在使用 Nokogiri 并尝试提取和 之间的 HTML代码>注释.

我想提取这两个 HTML 注释之间的完整元素:


一些文字<div><p>更多元素</p></div>

我可以使用此字符回调获取纯文本版本:

class TextExtractor <Nokogiri::XML::SAX::Document定义初始化@有趣的=假@text = ""@html = ""结尾定义注释(字符串)case string.strip # 去除前导和尾随空格when/^begin content/# 匹配起始注释@有趣的=真的当/^结束内容/@interesting = false # 匹配结束评论结尾定义字符(字符串)@文本<<<字符串如果@interesting结尾结尾

我使用 @text 获得纯文本版本，但我需要存储在 @html 中的完整 HTML.

解决方案

在两个节点之间提取内容不是我们常做的事情；通常我们希望在特定节点内有内容.评论是节点，它们只是特殊类型的节点.

需要'nokogiri'doc = Nokogiri::HTML(<<EOT)<身体><!-- 开始内容--><div>一些文本</div><div><p>更多元素</p></div><!-- 结束内容-->EOT

通过查找包含指定文本的注释，可以找到一个起始节点:

start_comment = doc.at("//comment()[contains(.,'begin content')]") # =>#<Nokogiri::XML::Comment:0x3fe94994268c开始内容">

一旦找到，就需要一个循环来存储当前节点，然后查找下一个兄弟节点，直到找到另一个注释:

content = Nokogiri::XML::NodeSet.new(doc)contains_node = start_comment.next_sibling循环做如果包含_node.comment 则中断?内容<<包含节点contains_node = contains_node.next_sibling结尾content.to_html # =>"\n <div>一些文本</div>\n <div><p>更多的元素</p></div>\n"

I have some HTML pages where the contents to be extracted are marked with HTML comments like below.

<html>
 .....
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
...
</html>

I am using Nokogiri and trying to extract the HTML between the  and  comments.



I want to extract the full elements between these two HTML comments:
<div>some text</div>
<div><p>Some more elements</p></div>
I can get the text-only version using this characters callback:
class TextExtractor < Nokogiri::XML::SAX::Document

  def initialize
    @interesting = false
    @text = ""
    @html = ""
  end

  def comment(string)
    case string.strip        # strip leading and trailing whitespaces
    when /^begin content/      # match starting comment
      @interesting = true
    when /^end content/
    @interesting = false   # match closing comment
  end

  def characters(string)
    @text << string if @interesting
  end

end
I get the text-only version with @text but I need the full HTML stored in @html.
 解决方案 
Extracting content between two nodes is not a normal thing we'd do; Normally we'd want content inside a particular node. Comments are nodes, they're just special types of nodes.
require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<body>
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
</body>
EOT
By looking for a comment containing the specified text it's possible to find a starting node:
start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #<Nokogiri::XML::Comment:0x3fe94994268c " begin content ">
Once that's found then a loop is needed that stores the current node, then looks for the next sibling until it finds another comment:
content = Nokogiri::XML::NodeSet.new(doc)
contained_node = start_comment.next_sibling
loop do
  break if contained_node.comment?
  content << contained_node
  contained_node = contained_node.next_sibling
end

content.to_html # => "\n <div>some text</div>\n <div><p>Some more elements</p></div>\n"


                        
这篇关于如何使用 Nokogiri 在两个 HTML 注释之间抓取 HTML?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何使用 Nokogiri 在两个 HTML 注释之间抓取 HTML? [英] How do I scrape HTML between two HTML comments using Nokogiri?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用 Nokogiri 在两个 HTML 注释之间抓取 HTML? [英] How do I scrape HTML between two HTML comments using Nokogiri?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭