在Python中清理HTML [英] Clean Up HTML in Python

查看：436 发布时间：2017/5/29 3:04:44 python html django

本文介绍了在Python中清理HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从几个外部来源聚合内容，并发现其中的一些内容包含HTML / DOM中的错误。一个很好的例子是HTML缺少关闭标签或格式错误的标签属性。有没有办法清理Python本身的错误或任何我可以安装的第三方模块？

解决方案

我会建议 Beautifulsoup 。它有一个很好的解析器，可以很好地处理格式错误的标签。一旦你读完整棵树，你可以输出结果。

  from BeautifulSoup import BeautifulSoup 
 tree = BeautifulSoup（bad_html）
 good_html = tree.prettify（）

时代，它的奇迹。如果您只是从bad-html中提取数据，那么BeautifulSoup在提取数据时真的会发光。

I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?

解决方案

I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

这篇关于在Python中清理HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中清理HTML [英] Clean Up HTML in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在Python中清理HTML [英] Clean Up HTML in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭