提取标题标签下的多行内容 [英] Extracting multiple-line content under header tags

查看:46
本文介绍了提取标题标签下的多行内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发布了一个类似的问题,没有考虑正文中的多行.我有一个像这样的 html,所以我想提取(使用 Nokogiri)的主体":

html = %q|<div class="内容"><h1>标题1</h1>Lorem ipsum 1<h2>标题2</h2>Lorem ipsum 2<h3>标题3</h3><p>段落内容1</p><b>Lorem ipsum 3</b><p>段落内容2</p><h1>标题4</h1>Lorem ipsum 4<h2>标题5</h2>Lorem ipsum 5

|

我想提取每个标题标题下的正文内容并将它们放入一个数组中,如下所示:

<预><代码>["Lorem ipsum 1","Lorem ipsum 2",<p>段落内容1</p><b>Lorem ipsum 3</b><p>段落内容2</p>","Lorem ipsum 4",Lorem ipsum 5"]

但是,当我这样做时:

Nokogiri::HTML(html).css(div").孩子们.拒绝{|e|e.name =~/\Ah\d\z/}.地图{|e|e.to_html.strip}.reject(&:empty?)

我得到了这个数组:

<预><代码>["Lorem ipsum 1","Lorem ipsum 2","<p>段落内容 1</p>","<b>Lorem ipsum 3</b>","<p>段落内容 2</p>","Lorem ipsum 4",Lorem ipsum 5"]

有没有办法提取多行正文"内容来显示我想要的数组?

解决方案

Nokogiri::HTML(html).css("div").children.slice_before{|e|e.name =~/\Ah\d\z/}.map{|a|a.drop(1).map{|e|e.to_html.strip}.join}.reject(&:empty?)

I posted a similar question that did not take into account multiple lines in the body. I have an html like so that I want to extract the "bodies" of (using Nokogiri):

html = %q|
    <div class="content">
      <h1>Title 1</h1>
        Lorem ipsum 1

      <h2>Title 2</h2>
        Lorem ipsum 2

      <h3>Title 3</h3>
        <p>paragraph content 1</p>
        <b>Lorem ipsum 3</b>
        <p>paragraph content 2</p>

      <h1>Title 4</h1>
        Lorem ipsum 4

      <h2>Title 5</h2>
        Lorem ipsum 5
   </div>
   |

I want to extract the body content under each header title and place them into an array like so:

[
  "Lorem ipsum 1",
  "Lorem ipsum 2",
  "<p>paragraph content 1</p><b>Lorem ipsum 3</b><p>paragraph content 2</p>",
  "Lorem ipsum 4",
  "Lorem ipsum 5"
]

However, when I do this:

Nokogiri::HTML(html).
  css("div").
  children.
  reject{|e| e.name =~ /\Ah\d\z/}.
  map{|e| e.to_html.strip}.reject(&:empty?)

I get this array instead:

[
  "Lorem ipsum 1",
  "Lorem ipsum 2",
  "<p>paragraph content 1</p>",
  "<b>Lorem ipsum 3</b>",
  "<p>paragraph content 2</p>",
  "Lorem ipsum 4",
  "Lorem ipsum 5"
]

Is there a way to extract the multiple line "body" content to display my desired array?

解决方案

Nokogiri::HTML(html)
.css("div").children
.slice_before{|e| e.name =~ /\Ah\d\z/}
.map{|a| a.drop(1).map{|e| e.to_html.strip}.join}.reject(&:empty?)

这篇关于提取标题标签下的多行内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆