lxml截断包含“小于"字符的文本 [英] lxml truncates text that contains 'less than' character

查看:50
本文介绍了lxml截断包含“小于"字符的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

>>> s = '<div> < 20 </div>'
>>> import lxml.html
>>> tree = lxml.html.fromstring(s)
>>> lxml.etree.tostring(tree)
'<div> </div>'

有人知道任何解决方法吗?

Does anybody know any workaround for this?

推荐答案

您的HTML输入已损坏;应该将<左尖括号编码为&lt;.从 lxml文档中了解解析损坏的HTML:

Your HTML input is broken; that < left angle bracket should have been encoded to &lt; instead. From the lxml documentation on parsing broken HTML:

对解析损坏的HTML的支持完全取决于libxml2的恢复算法.如果发现文档严重损坏以至于解析器无法处理它们,这不是lxml的错.也不能保证结果树将包含原始文档中的所有数据.解析器在努力进行解析时可能不得不掉落严重损坏的部分.尤其是放错位置的meta标签可能会因此而受苦,这可能会导致编码问题.

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

换句话说,您将利用从此类文档中获得的东西,否则lxml处理损坏的HTML的方式就无法配置.

In other words, you take what you can get from such documents, the way lxml handles broken HTML is not otherwise configurable.

可以尝试的一件事是使用不同 HTML解析器.尝试使用 BeautifulSoup ,它的HTML处理不当可能会给您带来麻烦.该文档的不同版本确实可以为您提供所需的功能. BeautifulSoup可以重新使用不同的解析器后端,包括lxmlhtml5lib,因此它将为您提供更大的灵活性.

One thing you could try is to use a different HTML parser. Try BeautifulSoup instead, it's broken HTML handling may be able to give you a different version of that document that does give you what you want out of it. BeautifulSoup can re-use different parser backends, including lxml and html5lib, so it'll give you more flexibility.

html5lib解析器确实为您提供了<字符(转换为&lt;转义):

The html5lib parser does give you the < character (converted to a &lt; escape):

>>> BeautifulSoup("<div> < 20 </div>", "html5lib")
<html><head></head><body><div> &lt; 20 </div></body></html>

这篇关于lxml截断包含“小于"字符的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆