抓取时如何避免加入来自节点的所有文本 [英] How to avoid joining all text from Nodes when scraping
问题描述
当我从 HTML 或 XML 中抓取几个相关节点来提取文本时,所有文本都被合并为一个长字符串,因此无法恢复单个文本字符串.
When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.
例如:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
但我想要的是:
["foo", "bar", "baz"]
抓取 XML 时也会发生同样的情况:
The same happens when scraping XML:
doc = Nokogiri::XML(<<EOT)
<root>
<block>
<entries>foo</entries>
<entries>bar</entries>
<entries>baz</entries>
</block>
</root>
EOT
doc.search('entries').text # => "foobarbaz"
为什么会发生这种情况,我该如何避免?
Why does this happen and how do I avoid it?
推荐答案
这是一个很容易解决的问题,因为没有阅读有关 text
在 NodeSet 和 Node 上使用时的行为的文档(或元素).
This is an easily solved problem that results from not reading the documentation about how text
behaves when used on a NodeSet versus a Node (or Element).
NodeSet 文档说text
将:
获取所有包含的 Node 对象的内部文本
Get the inner text of all contained Node objects
我们所看到的情况:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
因为:
doc.search('p').class # => Nokogiri::XML::NodeSet
相反,我们想要获取每个节点并提取其文本:
Instead, we want to get each Node and extract its text:
doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
可以使用map
来完成:
doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby 允许我们使用:
Ruby allows us to write that more concisely using:
doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
无论我们使用 HTML 还是 XML,同样的事情都适用,因为 HTML 是 XML 的更宽松版本.
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.
节点有几个别名方法来获取其嵌入的文本.来自 文档:
A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ⇒ 对象
也称为:text
、inner_text
返回此节点的内容.
这篇关于抓取时如何避免加入来自节点的所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!