如何用Nokogiri漂亮地打印HTML? [英] How do I pretty-print HTML with Nokogiri?

查看:114
本文介绍了如何用Nokogiri漂亮地打印HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Ruby中编写了一个Web爬虫程序,并使用 Nokogiri :: HTML 来解析页面。我需要将页面打印出来,同时在IRB中搞乱了我注意到 pretty_print 方法。但是它需要一个参数,我无法弄清楚它的用途。



我的抓取工具正在缓存网页的HTML并将其写入本地计算机上的文件。我想漂亮地打印HTML,以便它看起来不错并且格式正确。 通过漂亮的打印HTML页面我认为你的意思是你想用正确的缩进格式重新格式化HTML结构。 Nokogiri不支持这个; pretty_print 方法用于pp库,输出仅用于调试。



有几个项目理解HTML足够好,能够重新格式化它,而不会破坏实际上显着的空白(着名的是 HTML Tidy ),但通过谷歌搜索,我发现这篇文章的标题是漂亮地打印带有Nokogiri和XSLT的XHTML

归结为:

  xsl = Nokogiri :: XSLT(File.open(pretty_print.xsl))
html = Nokogiri(File.open(source.html) )
puts xsl.apply_to(html).to_s

当然,将链接的xsl文件下载到您的文件系统。我已经在我的机器上很快尝试过它,它的功能就像一个魅力。


I wrote a web crawler in Ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants.

My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.

解决方案

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked xsl file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

这篇关于如何用Nokogiri漂亮地打印HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆