如何告诉 BeautifulSoup 将特定标签的内容提取为文本?(不碰它) [英] How to tell BeautifulSoup to extract the content of a specific tag as text? (without touching it)

查看:13
本文介绍了如何告诉 BeautifulSoup 将特定标签的内容提取为文本?(不碰它)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析一个包含代码"标签的 html 文档

I need to parse an html document which contains "code" tags

我得到这样的代码块:

soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

问题是,如果我有这样的代码标签:

The problem is, if i have a code tag like this:

<code class="csharp">
    List<Person> persons = new List<Person>();
</code>

BeautifulSoup 强制关闭嵌套标签并将代码块转换为:

BeautifulSoup forse the closing of nested tags and transform the code block into:

<code class="csharp">
    List<person> persons = new List</person><person>();
    </person>
</code>

有什么方法可以使用 BeautifulSoup 将代码标签的内容提取为文本,而不会让它修复 IT 认为是 html 标记错误的内容?

is there any way to extract the content of the code tags as text with BeautifulSoup without letting it fix what IT thinks are html markup errors?

推荐答案

将代码标签添加到 QUOTE_TAGS 字典中.

Add the code tag to the QUOTE_TAGS dictionary.

from BeautifulSoup import BeautifulSoup

content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"

BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

输出:

[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]

这篇关于如何告诉 BeautifulSoup 将特定标签的内容提取为文本?(不碰它)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆