Hpricot,从文档中获取所有文本 [英] Hpricot, Get all text from document
问题描述
我刚开始学习Ruby。非常酷的语言,喜欢它很多。
我使用非常方便的Hpricot HTML解析器。
我期待的是抓取页面中的所有文本,不包括HTML标签。
示例:
<!DOCTYPE HTML PUBLIC - // W3C // DTD HTML 4.01 Transitional // EN>
< html>
< head>
< title>数据保护检查< /标题>
< meta http-equiv =Content-Typecontent =text / html; charset = UTF-8>
< / head>
< body>
< div>
这是我想要抓住的。
< / div>
< p>
我也想抓住这段文字
< / p>
< / body>
< / html>
我基本上只想抓取文本,所以最终得到的字符串如下所示:
这就是我想要的,我也想抓住这段文字
这样做的最好方法是什么?
干杯
Eef
您可以使用XPath text()
选择器来完成此操作。
$ b
require'hpricot'
require'open-uri'
doc = open(http:// stackoverflow .com /){| f | Hpricot(f)}
text =(doc /// * / text())#数组文本值
放置text.join(\\\
)
然而这是一个公平的昂贵操作。可能会有更好的解决方案。
I have just started learning Ruby. Very cool language, liking it a lot.
I am using the very handy Hpricot HTML parser.
What I am looking to do is grab all the text from the page, excluding the HTML tags.
Example:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Data Protection Checks</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
This is what I want to grab.
</div>
<p>
I also want to grab this text
</p>
</body>
</html>
I am basically wanting to grab only the text so I end up with a string like so:
"This is what I want to grab. I also want to grab this text"
What would be the best method of doing this?
Cheers
Eef
You can do this using the XPath text()
selector.
require 'hpricot'
require 'open-uri'
doc = open("http://stackoverflow.com/") { |f| Hpricot(f) }
text = (doc/"//*/text()") # array of text values
puts text.join("\n")
However this is a fair expensive operation. A better solution might be available.
这篇关于Hpricot,从文档中获取所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!