如何使用 Nokogiri 获取第一个元素的文本? [英] How can I get the first element's text using Nokogiri?

查看:51
本文介绍了如何使用 Nokogiri 获取第一个元素的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此 HTML 中获取 Last sell date 的文本:

<span title="最后销售日期">2002 年 5 月</span><button class="btn btn-previous-sales js-btn-previous-sales">往期销售(一)<i class="icon icon-down-open-1"/><div class="previous-sales-panel is-hidden"><span style="display: block;">1997 年 8 月<span class="fright">£60,000</span></span>

</td>

我试过了:

 date = val.search(".//td[@class='browse-cell-date']").children[1]

它给了我想要的跨度,但是在添加 .text 之后,没有返回任何东西.

解决方案

我会从:

需要'nokogiri'doc = Nokogiri::HTML(<<EOT)<td class="browse-cell-date"><span title="最后销售日期">2002 年 5 月</span><button class="btn btn-previous-sales js-btn-previous-sales">往期销售(一)<i class="icon icon-down-open-1"/><div class="previous-sales-panel is-hidden"><span style="display: block;">1997 年 8 月<span class="fright">£60,000</span></span>

</td>EOTsell_date = doc.at('span[title="最后销售日期"]') # =>#<Nokogiri::XML::Element:0x3ffc7e84c35c name="span" attributes=[#<Nokogiri::XML::Attr:0x3ffc7e84c2f8 name="title" value="最后销售日期">] children=[#<Nokogiri::XML::Text:0x3ffc7e82bc10 "\n 2002 年 5 月 \n ">]>已售日期.text # =>"\n 2002 年 5 月 \n "sell_date.text.strip # =>2002 年 5 月"

所以

doc.at('span[title="最后销售日期"]').text.strip # =>2002 年 5 月"

会做的.

at 就像 search('some selector').first 所以为了方便起见使用它.atsearch 都足够聪明,可以在大多数情况下确定选择器是 CSS 还是 XPath,所以我使用它们.如果 Nokogiri 被愚弄,我将恢复使用 *_css*_xpath 变体之一.

或者你可以使用:

doc.at('td.browse-cell-date span').text.strip # =>2002 年 5 月"doc.at('td.browse-cell-date > span').text.strip # =>2002 年 5 月"

注意:将 text 与任何 searchxpathcss 方法一起使用都不好主意.这些方法返回一个 NodeSet,当您使用其 text 方法时,它不会执行您期望的操作.考虑以下示例:

需要'nokogiri'doc = Nokogiri::HTML(<<EOT)<身体><p>foo</p><p>bar</p></html>EOTdoc.search('p').class # =>Nokogiri::XML::NodeSetdoc.search('p').text # =>foobar"

我们经常看到人们这样做的问题,然后需要弄清楚如何将连接的文本拆分为有用的内容,这通常非常困难.

99.99% 的情况下,您希望使用以下 map(&:text) 从 NodeSet 中提取文本:

doc.search('p').map(&:text) # =>["foo", "bar"]

但是,在您的使用中,只需使用 at,它会返回一个节点,然后 text 将执行您期望的操作.

I am trying to get the text for Last sold date from this HTML:

<td class="browse-cell-date">

    <span title="Last sold date">
        May 2002 
    </span>

    <button class="btn btn-previous-sales js-btn-previous-sales">
        Previous sales (1) <i class="icon icon-down-open-1"/>
    </button>

    <div class="previous-sales-panel is-hidden">
        <span style="display: block;">
            Aug 1997
            <span class="fright">£60,000</span>
        </span>
    </div>

</td>

I tried:

    date = val.search(".//td[@class='browse-cell-date']").children[1]

It gave me the span I wanted but after adding .text to it, did not returned anything.

解决方案

I'd start with:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
    <td class="browse-cell-date">

        <span title="Last sold date">
            May 2002 
        </span>

        <button class="btn btn-previous-sales js-btn-previous-sales">
            Previous sales (1) <i class="icon icon-down-open-1"/>
        </button>

        <div class="previous-sales-panel is-hidden">
            <span style="display: block;">
                Aug 1997
                <span class="fright">£60,000</span>
            </span>
        </div>

    </td>
EOT

sold_date = doc.at('span[title="Last sold date"]') # => #<Nokogiri::XML::Element:0x3ffc7e84c35c name="span" attributes=[#<Nokogiri::XML::Attr:0x3ffc7e84c2f8 name="title" value="Last sold date">] children=[#<Nokogiri::XML::Text:0x3ffc7e82bc10 "\n            May 2002 \n        ">]>
sold_date.text # => "\n            May 2002 \n        "
sold_date.text.strip # => "May 2002"

So

doc.at('span[title="Last sold date"]').text.strip # => "May 2002"

will do it.

at is like search('some selector').first so use it for convenience. Both at and search are smart enough to figure out whether the selector is CSS or XPath most of the time so I use those. If Nokogiri is fooled I'll revert to using one of the *_css or *_xpath variants.

Alternately you could use:

doc.at('td.browse-cell-date span').text.strip # => "May 2002"
doc.at('td.browse-cell-date > span').text.strip # => "May 2002"

Note: Using text with any of the search, xpath or css methods isn't a good idea. Those methods return a NodeSet, which doesn't do what you expect when you use its text method. Consider these examples:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
    <body>
        <p>foo</p>
        <p>bar</p>
    </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"

We regularly see questions where people have done this and then need to figure out how to split the concatenated text into something useful, which usually is very difficult.

99.99% of the time, you want to use the following map(&:text) to extract the text from a NodeSet:

doc.search('p').map(&:text) # => ["foo", "bar"]

But, in your use, simply use at, which returns a Node and then text will do what you expect.

这篇关于如何使用 Nokogiri 获取第一个元素的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆