读取< content:encoded>使用BeautifulSoup 4的标签 [英] Reading <content:encoded> tags using BeautifulSoup 4

查看:76
本文介绍了读取< content:encoded>使用BeautifulSoup 4的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup 4(bs4)读取XML RSS feed,并且遇到了以下条目.我正在尝试阅读<content:encoded><![CDATA[...]]</content>标记中包含的内容:

I'm using BeautifulSoup 4 (bs4) to read an XML RSS feed, and have come across the following entry. I'm trying to read the content enclosed in the <content:encoded><![CDATA[...]]</content> tag:

<item>
    <title>Foobartitle</title>
    <link>http://www.acme.com/blah/blah.html</link>
    <category><![CDATA[mycategory]]></category>
    <description><![CDATA[The quick brown fox jumps over the lazy dog]]></description>
    <content:encoded>
        <![CDATA[<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>]]>
    </content:encoded>
</item>

据我了解,此格式是的一部分RSS内容模块,并且非常常见.

As I understand it, this format is part of the RSS content module and is pretty common.

我想隔离<content:encoded>标记,然后读取CDATA内容. 为避免疑问,结果将为<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>.

I'd like to isolate the <content:encoded> tag and then read the CDATA contents. For the avoidance of doubt, the result would be <p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>.

我查看了

I've looked at this, this, and this stackoverflow post but I've not been able to figure out how to get the job done since they are not directly related to my case.

我正在使用 lxml XML 解析器与bs4.

I am using the lxml XML parser with bs4.

有什么建议吗?谢谢!

推荐答案

from bs4 import BeautifulSoup

doc = ...
soup = BeautifulSoup(doc, "xml")  # Directs bs to use lxml

有趣的是,BeautifulSoup/lxml更改了周围的标签,从content:encoded更改为简单的encoded.

Interestingly, BeautifulSoup/lxml changes the tags around, noticeably from content:encoded to simply encoded.

>>> print soup
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Foobartitle</title>
<link>http://www.acme.com/blah/blah.html</link>
<category>mycategory</category>
<description>The quick brown fox jumps over the lazy dog</description>
<encoded>
        &lt;p&gt;&lt;img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /&gt;&lt;/p&gt;
    </encoded>
</item>

从那里开始,它应该只是解析子级.

From there, it should do to just parse through the children.

for encoded_content in soup.findAll("encoded"):
    for child in encoded_content.children:
        print child

结果为<p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>.请注意,这似乎是bs4.element.NavigableString的实例,而不是链接的答案中的CData.

That results in <p><img class="feature" src="http://www.acme.com/images/image.jpg" alt="" /></p>. Note, this seems to be an instance of bs4.element.NavigableString, not CData like in your linked answers.

这篇关于读取&lt; content:encoded&gt;使用BeautifulSoup 4的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆