如何在 Python 中解析损坏的 XML? [英] How to parse broken XML in Python?

查看：26 发布时间：2021/10/1 19:54:12 python xml

本文介绍了如何在 Python 中解析损坏的 XML?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我无法影响的服务器发送非常损坏的 XML.

A sever I can't influence sends very broken XML.

具体来说，Unicode WHITE STAR 将被编码为 UTF-8 (E2 98 86)，然后使用 Latin-1 转换为 HTML 实体表.我得到的是 â98 86(9 个字节)在声明为 utf-8 且没有 DTD 的文件中.

Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86 (9 bytes) in a file that's declared as utf-8 with no DTD.

我无法以一种不会造成不可逆转的乱码的方式来配置 W3C 整洁.我只找到了如何让 lxml 静默跳过它.SAX使用Expat，遇到这个无法恢复.出于速度原因，我想避免使用 BeautifulSoup.

I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encountering this. I'd like to avoid BeautifulSoup for speed reasons.

还有什么?

推荐答案

可能类似于:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )

这篇关于如何在 Python 中解析损坏的 XML?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 Python 中解析损坏的 XML? [英] How to parse broken XML in Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 Python 中解析损坏的 XML? [英] How to parse broken XML in Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭