在ruby上使用xpath获取html片段的前几个元素 [英] Get first few elements of a html fragment with xpath on ruby

查看:168
本文介绍了在ruby上使用xpath获取html片段的前几个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于像项目这样的博客,我想从markdown生成的html片段中得到前几个段落,标题,列表或任何字符范围内的任何内容,以显示为摘要。

For a blog like project, I want to get the first few paragraphs, headers, lists or whatever within a range of characters from a markdown generated html fragment to display as a summary.

因此,如果我有

So if I have

<h1>hello world</h1>
<p>Lets say these are 100 chars</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>

假设,我想在前150个字符内总结文本(不必过度确切地说,我只能得到前150个字符,包括标签并继续,但可能会在尾部产生一些可能更难以处理的文物...),它应该给我h1,p和ul,但不是最终的p(将被截断)。如果第一个元素应该有超过150个字符,我会将第一个元素填满。

And assume, I want to summarize with text within the first 150 chars (does not have to be overly exact, I could just get the first 150 chars, including tags and go on with that, but probably would create some artifacts at the tail which could be more difficult to handle...), it should give me the h1, the p and the ul, but not the final p (which would be truncated). If the first element should have more than 150 chars, I would take the full first element.

我怎么能得到这个?使用XPath或正则表达式?我有点没有想法......

How could I get this? Using XPath or a regex? I am a bit without ideas on that...

首先我想给一个大的感谢您给所有回复您的人!

First I want to give a big THANK YOU to all of you who replied!

虽然我在这个主题中得到了非常好的答案,但实际上我发现插入之前更容易markdown解释器会打开,采用\r\\\
\r\\\
分隔的前n个文本块,并将其传递给md代。

While I got really great answers in this thread, I actually found it much easier to plug in before the markdown interpreter hits in, take the first n textblocks separated by \r\n\r\n and just pass this on for md generation.

  class String
    def summarize_md length
        arr = self.split(/\r\n\r\n/)
        sum =""
        arr.each do |ea|
          break if sum.length + ea.length > length
          sum = sum+"#{ea}\r\n\r\n"
        end
        sum
      end
  end

虽然有人可能会将此代码缩减为一个班轮,但它仍然比任何提议更简单和更友善解决方案。
无论如何,因为我的问题可以被解释为,如果html是起点(而不是md文本),我只会给第一个人提供答案......我希望这只是...

while one probably could reduce this code to a one liner, it is still much simpler and cpu friendlier than any of the proposed solutions. Anyway, since my question could be interpreted such as if the html was the starting point (and not the md text), I'll just give the answer to the first guy... I hope that's just...

推荐答案

使用XPath是最健壮和最灵活的。下面是一个示例应用程序:

Using XPath is the most robust and flexible. Here's a sample app:

require 'rubygems'
require 'nokogiri'

html = <<End
<h1>hello world</h1>
<p>Lets say these are 100 chars.......................................................................</p>
<ul>
    <li>some bla bla, 40 chars</li>
</ul>
<p>some other text</p>
End

LIMIT = 150
summary = ""

doc = Nokogiri::HTML.parse(html)
doc.xpath('//text()').each do |node|
  text = node.text
  break if summary.length + text.length >= LIMIT
  summary << text
end

puts summary
puts summary.length

XPath // text()只是选择文档中的所有文本节点。如果你想更具体地说明你感兴趣的元素,你可以。

The XPath //text() simply selects all the text nodes in the document. If you wanted to be more specific about which elements you were interested in, you can.

这篇关于在ruby上使用xpath获取html片段的前几个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆