如何在 Ruby 中解析 Nokogiri 返回的这个数据结构? [英] How do I parse this data structure returned by Nokogiri in Ruby?

查看:44
本文介绍了如何在 Ruby 中解析 Nokogiri 返回的这个数据结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我循环遍历一个数组元素,这是返回的结果:

So I am cycling through an array element and this is the result returned:

[nil, [#<Nokogiri::XML::Element:0x835386d4 name="a" attributes=[#<Nokogiri::XML::Attr:0x835385f8 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x835381c0 "Web Designer Full time">]>

我想做的是访问 href 值,然后是 text 值.我该怎么做?

What I would like to do is access href value, and then the text value. How do I do that?

我试过了:

puts i[:href]

但这会产生这个错误:

TypeError: Symbol as array index

顺便说一下,我通过 each 像这样访问 i 作为数组中的一个元素:

By the way, I am accessing i as an element in the array via each like this:

contents.each do |i|
    puts i.inspect
    puts i[:href]
end

编辑 1:

这就是我生成 contents 数组的方式.没有必要重命名它,因为它会让人困惑:)

This is how I am generating the contents array. There is no need to rename it, because it can get confusing :)

contents = {}
first_items.each do |link|
    content_url = link
    content_page = Nokogiri::HTML(open(content_url))
    contents[link[:href]] = content_page.css("p a")
end

puts contents.inspect

这就是输出:

{nil=>[#<Nokogiri::XML::Element:0x85fee914 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee838 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x85fee400 "Web Designer Full time">]>, #<Nokogiri::XML::Element:0x85fee298 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee1bc name="href" value="http://bham.craigslist.org/web/2959813303.html">] children=[#<Nokogiri::XML::Text:0x85fedd84 "Once in a lifetime opportunity...">]>, #<Nokogiri::XML::Element:0x85fedc1c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fedb40 name="href" value="http://bham.craigslist.org/web/2925485723.html">] children=[#<Nokogiri::XML::Text:0x85fed708 "Website Designer and Blogging Internship!">]>, #<Nokogiri::XML::Element:0x85fed5a0 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fed4c4 name="href" value="http://bham.craigslist.org/web/2918424652.html">] children=[#<Nokogiri::XML::Text:0x85fed08c "Excellent Java Developer Opportunity!">]>, #<Nokogiri::XML::Element:0x85fecf24 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fece48 name="href" value="http://bham.craigslist.org/web/2888669703.html">] children=[#<Nokogiri::XML::Text:0x85feca10 "Freelance Graphic Design">]>, #<Nokogiri::XML::Element:0x85fec8a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec7cc name="href" value="http://bham.craigslist.org/web/2900256461.html">] children=[#<Nokogiri::XML::Text:0x85fec394 "GWT/GXT Developer">]>, #<Nokogiri::XML::Element:0x85fec22c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec150 name="href" value="http://bham.craigslist.org/web/2897641463.html">] children=[#<Nokogiri::XML::Text:0x85febd18 "Website hiring!">]>]}

这是 i 输出的完整值:

Here is the full value of the output for i:

--------------------
This is the value of i: 
[nil, [#<Nokogiri::XML::Element:0x85fee914 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee838 name="href" value="http://bham.craigslist.org/web/2961573018.html">] children=[#<Nokogiri::XML::Text:0x85fee400 "Web Designer Full time">]>, #<Nokogiri::XML::Element:0x85fee298 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fee1bc name="href" value="http://bham.craigslist.org/web/2959813303.html">] children=[#<Nokogiri::XML::Text:0x85fedd84 "Once in a lifetime opportunity...">]>, #<Nokogiri::XML::Element:0x85fedc1c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fedb40 name="href" value="http://bham.craigslist.org/web/2925485723.html">] children=[#<Nokogiri::XML::Text:0x85fed708 "Website Designer and Blogging Internship!">]>, #<Nokogiri::XML::Element:0x85fed5a0 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fed4c4 name="href" value="http://bham.craigslist.org/web/2918424652.html">] children=[#<Nokogiri::XML::Text:0x85fed08c "Excellent Java Developer Opportunity!">]>, #<Nokogiri::XML::Element:0x85fecf24 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fece48 name="href" value="http://bham.craigslist.org/web/2888669703.html">] children=[#<Nokogiri::XML::Text:0x85feca10 "Freelance Graphic Design">]>, #<Nokogiri::XML::Element:0x85fec8a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec7cc name="href" value="http://bham.craigslist.org/web/2900256461.html">] children=[#<Nokogiri::XML::Text:0x85fec394 "GWT/GXT Developer">]>, #<Nokogiri::XML::Element:0x85fec22c name="a" attributes=[#<Nokogiri::XML::Attr:0x85fec150 name="href" value="http://bham.craigslist.org/web/2897641463.html">] children=[#<Nokogiri::XML::Text:0x85febd18 "Website hiring!">]>]]
--------------------
This is the value of i.href: 

编辑 2:

顺便说一下,这就是实际的 HTML 输出的样子……我是这样做的:

By the way, this is what the actual HTML output looks like...I did this:

builder = Nokogiri::HTML::Builder.new do |doc|
    doc.html {
        doc.body {
            contents.each do |el|
                if !el.nil?
                    puts "-" * 20
                    puts "This is the value of el: "
                puts el.inspect

                    puts "-" * 20
                    puts "This is the value of el.href: "           
                 puts el[:href]
                end

                doc.p {
                    doc.a el, :href => el
                    } 
            end     
            }           
        }
end

puts "*" * 50
puts "This is the HTML generated"

puts builder.to_html

这是它的样子:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p><a href="&lt;a%20href=%22http://bham.craigslist.org/web/2961573018.html%22&gt;Web%20Designer%20Full%20time&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2959813303.html%22&gt;Once%20in%20a%20lifetime%20opportunity...&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2925485723.html%22&gt;Website%20Designer%20and%20Blogging%20Internship!&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2918424652.html%22&gt;Excellent%20Java%20Developer%20Opportunity!&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2888669703.html%22&gt;Freelance%20Graphic%20Design&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2900256461.html%22&gt;GWT/GXT%20Developer&lt;/a&gt;&lt;a%20href=%22http://bham.craigslist.org/web/2897641463.html%22&gt;Website%20hiring!&lt;/a&gt;">&lt;a href="http://bham.craigslist.org/web/2961573018.html"&gt;Web Designer Full time&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2959813303.html"&gt;Once in a lifetime opportunity...&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2925485723.html"&gt;Website Designer and Blogging Internship!&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2918424652.html"&gt;Excellent Java Developer Opportunity!&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2888669703.html"&gt;Freelance Graphic Design&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2900256461.html"&gt;GWT/GXT Developer&lt;/a&gt;&lt;a href="http://bham.craigslist.org/web/2897641463.html"&gt;Website hiring!&lt;/a&gt;</a></p></body></html>

推荐答案

我认为它可以简单得多.Nokogiri 已经解析了文档并提供了访问内容的便捷方式.与其循环、存储 Nokogiri 对象,然后尝试提取它们,为什么不尝试更直接的方法?

I think it can be a lot simpler. Nokogiri already parses the document and provides convenient ways to access the content. Rather than looping, storing Nokogiri objects, then trying to extract them, why not try a more direct approach?

试试这个代码:

content_page.search(//a[@href]).map{ |el| [el[:href], el.text] }

这将创建包含文档中每个链接的 text 和 href 的二维数组,这就是您在后续评论中所说的,您实际上正在努力.

This creates the 2d array containing the text and href for each link in the document, which is what you said in a follow-up comment that you're actually working toward.

这篇关于如何在 Ruby 中解析 Nokogiri 返回的这个数据结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆