如何使用BeautifulSoup4删除XML声明 [英] How Do I Remove An XML Declaration Using BeautifulSoup4

查看:144
本文介绍了如何使用BeautifulSoup4删除XML声明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个结构如下的XHTML文件:

I have an XHTML file that is structured like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

我正在使用BeautifulSoup,我想从文档中删除XML声明,所以我的样子是这样的:

I'm using BeautifulSoup and I want to remove the XML declaration from the document, so what I have looks like this:

<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

我找不到找到将其删除的XML声明的方法.据我所知,它似乎不是Doctype,Declaration,Tag或NavigableString.有什么方法可以找到我要提取的吗?

I can't find a way to get at the XML declaration to remove it. It doesn't appear to be a Doctype, Declaration, Tag, or NavigableString as far as I can tell. Is there a way I can find this to extract it?

作为一个工作示例,我可以使用如下代码删除Doctype(假设文档文本为变量"html"):

As a working example, I can remove the Doctype with code like this (assuming the document text is the variable "html"):

soup = BeautifulSoup(html)
[item.extract() for item in soup.contents if isinstance(item, Doctype)]

推荐答案

您可以使用以下方法:

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

for e in soup:
    if isinstance(e, bs4.element.ProcessingInstruction):
        e.extract()
        break

print(soup)

对于您的示例,这将为您提供更新的HTML,如下所示:

For your sample, this would give you the updated HTML as:

<!DOCTYPE html>

<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html></html></html>

这篇关于如何使用BeautifulSoup4删除XML声明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆