如何解析不严格的HTML文件? [英] How to parse not strict HTML documents indulgently?

查看:115
本文介绍了如何解析不严格的HTML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天我还有一个问题

是否有任何html语法分析器没有严格的语法分析器可用?

据我所见,这样的分析器是在web浏览器中构建的

i表示应该非常高兴得到一个解析器,该解析器可以放心地处理输入文档,从而允许以下任何在xhtml和xml中无效的情况:




  • 不是自我关闭的单个标签。例如:< br> < hr> ...

  • 不匹配的套管对:< td> ... < / TD>

  • 不带引号的属性:< span class = hilite> ...< / SPAN>
  • 等等...等等


建议任何合适的解析器,请在
处感谢

解决方案

如果您对Python感到满意, Beautiful Soup 就是这样一个解析器。 只是试图从中获得一些数据,现在,你并不在乎HTML应该是什么样子,这个解析器也不是。

i've got one more question today
are there any html parsers with not strict syntax analyzers available?
as far as i can see such analyzers are built in web browsers
i mean it should be very nice to get a parser that indulgently process the input document allowing any of the following situations that are invalid in xhtml and xml:

  • not self-closed single tags. for example: <br> or <hr>...
  • mismatched casing pairs: <td>...</TD>
  • attributes with no quotes marks: <span class=hilite>...</SPAN>
  • so on and so on... etc

suggest any suitable parser, please
thank you

解决方案

If you're happy with Python, Beautiful Soup is just such a parser.

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

这篇关于如何解析不严格的HTML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆