如何避免拼凑时加入节点中的所有文本 [英] How to avoid joining all text from Nodes when scraping
问题描述
例如:
require'nokogiri'
doc = Nokogiri :: HTML(< ;< EOT)
< html>
< body>
< p> foo< / p>
< p>栏< / p>
< p> baz< / p>
< / body>
< / html>
EOT
doc.search('p')。text#=> foobarbaz
但我想要的是:
[foo,bar,baz]
在抓取XML时会发生同样的情况:
doc = Nokogiri :: XML(<< EOT)
< root>
< block>
<条目> foo< /条目>
<条目>栏< /条目>
<条目> baz< /条目>
< / block>
< / root>
EOT
doc.search('entries')。text#=> foobarbaz
为什么会发生这种情况,我该如何避免它?
这是一个很容易解决的问题,因为没有阅读关于 text
在使用时如何表现的文档在NodeSet与Node(或Element)之间。
NodeSet文档说: text
will:
获取所有包含Node对象的内部文本
这就是我们看到的情况:
doc = Nokogiri :: HTML(< EOT)
< html>
< body>
< p> foo< / p>
< p>栏< / p>
< p> baz< / p>
< / body>
< / html>
EOT
doc.search('p')。text#=> foobarbaz
因为:
doc.search('p')。class#=> Nokogiri :: XML :: NodeSet
相反,我们想要获取每个节点并提取其文本:
doc.search('p')。first.class#=> Nokogiri :: XML :: Element
doc.search('p')。first.text#=> foo
可以使用 map
:
doc.search('p')。map {| node | node.text}#=> [foo,bar,baz]
Ruby允许我们写得更简洁使用:
doc.search('p')。map(& text:文本)#=> [foo,bar,baz]
无论我们使用HTML或XML,因为HTML是一种更轻松的XML版本。
Node有几种别名方法来获取嵌入文本。从文档:
#content⇒对象
也被称为:
text
,inner_text
返回这个节点的内容。
When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.
For instance:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
But what I want is:
["foo", "bar", "baz"]
The same happens when scraping XML:
doc = Nokogiri::XML(<<EOT)
<root>
<block>
<entries>foo</entries>
<entries>bar</entries>
<entries>baz</entries>
</block>
</root>
EOT
doc.search('entries').text # => "foobarbaz"
Why does this happen and how do I avoid it?
This is an easily solved problem that results from not reading the documentation about how text
behaves when used on a NodeSet versus a Node (or Element).
The NodeSet documentation says text
will:
Get the inner text of all contained Node objects
Which is what we're seeing happen with:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
because:
doc.search('p').class # => Nokogiri::XML::NodeSet
Instead, we want to get each Node and extract its text:
doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
which can be done using map
:
doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby allows us to write that more concisely using:
doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.
A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ⇒ Object
Also known as:
text
,inner_text
Returns the contents for this Node.
这篇关于如何避免拼凑时加入节点中的所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!