提取标题标签下的多行内容 [英] Extracting multiple-line content under header tags
本文介绍了提取标题标签下的多行内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我发布了一个类似的问题,没有考虑正文中的多行.我有一个像这样的 html,所以我想提取(使用 Nokogiri)的主体":
html = %q|<div class="内容"><h1>标题1</h1>Lorem ipsum 1<h2>标题2</h2>Lorem ipsum 2<h3>标题3</h3><p>段落内容1</p><b>Lorem ipsum 3</b><p>段落内容2</p><h1>标题4</h1>Lorem ipsum 4<h2>标题5</h2>Lorem ipsum 5
|
我想提取每个标题标题下的正文内容并将它们放入一个数组中,如下所示:
<预><代码>["Lorem ipsum 1","Lorem ipsum 2",<p>段落内容1</p><b>Lorem ipsum 3</b><p>段落内容2</p>","Lorem ipsum 4",Lorem ipsum 5"]但是,当我这样做时:
Nokogiri::HTML(html).css(div").孩子们.拒绝{|e|e.name =~/\Ah\d\z/}.地图{|e|e.to_html.strip}.reject(&:empty?)
我得到了这个数组:
<预><代码>["Lorem ipsum 1","Lorem ipsum 2","<p>段落内容 1</p>","<b>Lorem ipsum 3</b>","<p>段落内容 2</p>","Lorem ipsum 4",Lorem ipsum 5"]有没有办法提取多行正文"内容来显示我想要的数组?
解决方案
Nokogiri::HTML(html).css("div").children.slice_before{|e|e.name =~/\Ah\d\z/}.map{|a|a.drop(1).map{|e|e.to_html.strip}.join}.reject(&:empty?)
I posted a similar question that did not take into account multiple lines in the body. I have an html like so that I want to extract the "bodies" of (using Nokogiri):
html = %q|
<div class="content">
<h1>Title 1</h1>
Lorem ipsum 1
<h2>Title 2</h2>
Lorem ipsum 2
<h3>Title 3</h3>
<p>paragraph content 1</p>
<b>Lorem ipsum 3</b>
<p>paragraph content 2</p>
<h1>Title 4</h1>
Lorem ipsum 4
<h2>Title 5</h2>
Lorem ipsum 5
</div>
|
I want to extract the body content under each header title and place them into an array like so:
[
"Lorem ipsum 1",
"Lorem ipsum 2",
"<p>paragraph content 1</p><b>Lorem ipsum 3</b><p>paragraph content 2</p>",
"Lorem ipsum 4",
"Lorem ipsum 5"
]
However, when I do this:
Nokogiri::HTML(html).
css("div").
children.
reject{|e| e.name =~ /\Ah\d\z/}.
map{|e| e.to_html.strip}.reject(&:empty?)
I get this array instead:
[
"Lorem ipsum 1",
"Lorem ipsum 2",
"<p>paragraph content 1</p>",
"<b>Lorem ipsum 3</b>",
"<p>paragraph content 2</p>",
"Lorem ipsum 4",
"Lorem ipsum 5"
]
Is there a way to extract the multiple line "body" content to display my desired array?
解决方案
Nokogiri::HTML(html)
.css("div").children
.slice_before{|e| e.name =~ /\Ah\d\z/}
.map{|a| a.drop(1).map{|e| e.to_html.strip}.join}.reject(&:empty?)
这篇关于提取标题标签下的多行内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文