如何在 Nokogiri 中收集节点的几个元素中的第一个 [英] How to collect the first of several elements of a node in Nokogiri

查看:34
本文介绍了如何在 Nokogiri 中收集节点的几个元素中的第一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有如下数据:

<艺术家><艺术家><name>Johnny Mnemonic</name></艺术家><艺术家><name>康斯坦丁</name></艺术家><艺术家><发布><艺术家><艺术家><名称>速度</名称></艺术家><艺术家><name>黑客帝国</name></艺术家><艺术家>...等等.

对于每个版本,我只需要来自第一个 标签的数据.我尝试了以下代码,但它从艺术家那里提取了所有文本:

page = Nokogiri::XML(open("37.xml"))page.xpath("//artists[1]").each do |el|File.open("#{LOCAL_DIR}/37.txt", 'a'){|f|f.write(el)}

解决方案

Nokogiri 支持两种主要的搜索类型,searchat.search 返回一个 NodeSet,你应该把它想象成一个数组.at 返回一个节点.两者都可以采用 CSS 或 XPath 表达式.我更喜欢 CSS,因为它们更具可读性,但有时您无法轻松获得想要的位置,因此请尝试另一种.

对于您的问题,使用 text 指定要从中提取文本的节点很重要.如果您的结果太宽泛,除了您想要的标签内的文本外,您还会从标签之间获得文本.为避免深入到您要阅读的内容的最直接节点:

需要'nokogiri'doc = Nokogiri::XML(<<EOT)<发布><艺术家><艺术家><name>Johnny Mnemonic</name></艺术家><艺术家><name>康斯坦丁</name></艺术家><艺术家><发布>EOT

因为这些专门寻找 name 节点,所以很容易获得所需的文本而不会产生垃圾:

doc.at('name').text # =>《强尼助记符》doc.at('艺术家姓名').text # =>《强尼助记符》doc.at('艺术家艺术家姓名').text # =>《强尼助记符》

这些是较松散的搜索,因此会返回更多垃圾:

doc.at('artist').text # =>"\n 强尼助记符\n "doc.at('artists').text # =>"\n \n 约翰尼助记符\n \n \n 康斯坦丁\n \n \n\n"

使用 search 返回多个节点:

doc.search('name').map(&:text)[[0]强尼助记符",[1] 《君士坦丁》]doc.search('艺术家').map(&:text)[[0] "\n 强尼助记符\n ",[1] "\n 君士坦丁\n "]

searchat 之间唯一真正的区别是 at 就像 search(...).first代码>.

参见如何避免从抓取时的节点"也是.

为了方便起见,Nokogiri 有一些额外的别名:at_csscss,以及 at_xpathxpath.<小时>

以下是替代方法,使用 CSS 和 XPath 访问器获取名称,从 Pry 中剪辑:

[5] (pry) main: 0># 在 Ruby 中使用 CSS[6](撬)主:0>艺术家 = doc.search('release').map{ |release|release.at('艺术家').text.strip }[[0]强尼助记符",[1]速度"][7](撬)主:0># 使用 CSS 和更少的 Ruby[8](撬)主:0>Artist = doc.search('发布艺术家艺术家:nth-​​child(1) name').map{ |n|文本 }[[0]强尼助记符",[1]速度"][9](撬)主:0>[10](撬)主:0># 使用 XPath[11](撬)主:0>艺术家 = doc.search('release/artists/artist[1]/name').map{ |t|t.content }[[0]强尼助记符",[1]速度"][12](撬)主:0># 使用更多的 XPath[13](撬)主:0>艺术家 = doc.search('release/artists/artist[1]/name/text()').map{ |t|t.content }[[0]强尼助记符",[1]速度"]

I have data that looks like:

<release> 
 <artists>
  <artist>
   <name>Johnny Mnemonic</name>
  </artist>
  <artist>
    <name>Constantine</name>
  </artist>
 <artists>
</release>
<release>
 <artists>
  <artist>
   <name>Speed</name>
  </artist>
  <artist>
    <name>The Matrix</name>
  </artist>
 <artists>
 </release>
 ...and so on.

For each release I want only the data from the first <artist> tag. I tried the following code but it pulls all text from the artists:

page = Nokogiri::XML(open("37.xml"))

page.xpath("//artists[1]").each do |el|

File.open("#{LOCAL_DIR}/37.txt", 'a'){|f| f.write(el)}

解决方案

Nokogiri supports two main types of searches, search and at. search returns a NodeSet, which you should think of like an array. at returns a Node. Either can take a CSS or XPath expression. I prefer CSS since they're more readable, but sometimes you can't easily get where you want to be with one, so try the other.

For your question, it's important to specify the node you want to extract the text from, using text. If your result is too broad you'll get text from between tags in addition to the text inside the tag you want. To avoid that drill down to the most-immediate node to what you're trying to read:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<release> 
<artists>
  <artist>
  <name>Johnny Mnemonic</name>
  </artist>
  <artist>
    <name>Constantine</name>
  </artist>
<artists>
<release>
EOT

Because these look for the name node specifically, the text desired is easy to get without garbage:

doc.at('name').text                # => "Johnny Mnemonic"
doc.at('artist name').text         # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"

These are looser searches so more junk is returned:

doc.at('artist').text  # => "\n   Johnny Mnemonic\n  "
doc.at('artists').text # => "\n  \n   Johnny Mnemonic\n  \n  \n    Constantine\n  \n \n\n"

Using search returns multiple nodes:

doc.search('name').map(&:text)

[
    [0] "Johnny Mnemonic",
    [1] "Constantine"
]

doc.search('artist').map(&:text)

[
    [0] "\n   Johnny Mnemonic\n  ",
    [1] "\n    Constantine\n  "
]

The only real difference between search and at is that at is like search(...).first.

See "How to avoid joining all text from Nodes when scraping" also.

Nokogiri has some additional aliases for convenience: at_css and css, and at_xpath and xpath.


Here are alternate ways, using CSS and XPath accessors to get at the names, clipped from Pry:

[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
    [0] "Johnny Mnemonic",
    [1] "Speed"
]

这篇关于如何在 Nokogiri 中收集节点的几个元素中的第一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆