BeautifulSoup 可以保留 CDATA 部分吗? [英] Can CDATA sections be preserved by BeautifulSoup?

查看:26
本文介绍了BeautifulSoup 可以保留 CDATA 部分吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 BeautifulSoup 读取、修改和写入 XML 文件.我在删除 CDATA 部分时遇到了麻烦.这是一个简化的示例.

I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example.

罪魁祸首 XML 文件:

The culprit XML file:

<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        !@#$%^&*()_+{}|:"<>?,./;'[]-=
    ]]></bar>
</foo>

这是 Python 脚本.

And here's the Python script.

from bs4 import BeautifulSoup

xmlfile = open("cdata.xml", "r") 
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)

这是输出.请注意缺少 CDATA 部分标记.

Here's the output. Note the CDATA section tags are missing.

<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
        !@#$%^&amp;*()_+{}|:"&lt;&gt;?,./;'[]-=
    </bar>
</foo>

我也尝试打印 soup.prettify(formatter="xml") 并得到相同的结果,但空白略有不同.文档中没有太多关于在 CDATA 部分中阅读的内容,所以这可能是 lxml 的事情?

I also tried printing soup.prettify(formatter="xml") and got the same result with slightly different whitespace. There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml thing?

有没有办法告诉 BeautifulSoup 保留 CDATA 部分?

Is there a way to tell BeautifulSoup to preserve CDATA sections?

更新 是的,这是一个 lxml 的事情.http://lxml.de/api.html#cdata 那么,问题就变成了可以告诉 BeautifulSoup 用 strip_cdata=False 初始化 lxml 吗?

Update Yes, it's an lxml thing. http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False?

推荐答案

就我而言,如果我使用

soup = BeautifulSoup( xmlfile, "lxml-xml" )

然后 cdata 被保留并可访问.

then cdata is preserved and accesible.

这篇关于BeautifulSoup 可以保留 CDATA 部分吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆