在Python中清理HTML [英] Clean Up HTML in Python
问题描述
我会建议 Beautifulsoup 。它有一个很好的解析器,可以很好地处理格式错误的标签。一旦你读完整棵树,你可以输出结果。
from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()
时代,它的奇迹。如果您只是从bad-html中提取数据,那么BeautifulSoup在提取数据时真的会发光。
I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?
I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.
from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()
I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.
这篇关于在Python中清理HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!