在Python中清理HTML [英] Clean Up HTML in Python

查看:436
本文介绍了在Python中清理HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从几个外部来源聚合内容,并发现其中的一些内容包含HTML / DOM中的错误。一个很好的例子是HTML缺少关闭标签或格式错误的标签属性。有没有办法清理Python本身的错误或任何我可以安装的第三方模块?

解决方案

我会建议 Beautifulsoup 。它有一个很好的解析器,可以很好地处理格式错误的标签。一旦你读完整棵树,你可以输出结果。

  from BeautifulSoup import BeautifulSoup 
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

时代,它的奇迹。如果您只是从bad-html中提取数据,那么BeautifulSoup在提取数据时真的会发光。


I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?

解决方案

I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()

I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

这篇关于在Python中清理HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆