为什么 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档? [英] Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

查看:34
本文介绍了为什么 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

YCombinator 非常好,可以提供 RSS 提要大型 RSS 提要 包含 HackerNews.我正在尝试编写一个 python 脚本来访问 RSS 提要文档,然后使用 BeautifulSoup 解析出某些信息.但是,当 BeautifulSoup 尝试获取每个项目的内容时,我遇到了一些奇怪的行为.

以下是 RSS 提要的一些示例行:

<频道><title>Hacker News</title><link>http://news.ycombinator.com/</link><description>知识分子的链接,按读者排名.</description><项目><title>EFF 专利项目从 Mark Cuban 和 &#39;Notch&#39;</title> 获得了 50 万美元的支持<link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link><comments>http://news.ycombinator.com/item?id=4944322</comments><描述><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">评论</a>]]></description></项目><项目><title>珠穆朗玛峰的 20 亿像素照片(你能找到登山者吗?)</title><link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link><comments>http://news.ycombinator.com/item?id=4943361</comments><描述><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">评论</a>]]></description></项目>...</频道></rss>

这是我编写的代码(用 python)来访问这个提要并打印出titlelinkcomments每个项目:

导入系统进口请求从 bs4 导入 BeautifulSouprequest = requests.get('http://news.ycombinator.com/rss')汤 = BeautifulSoup(request.text)项目 = 汤.find_all('项目')对于项目中的项目:title = item.find('title').textlink = item.find('link').text评论 = item.find('comments').text打印标题 + ' - ' + 链接 + ' - ' + 评论

然而,这个脚本给出的输出如下所示:

EFF 专利项目从 Mark Cuban 和 &#39;Notch' 那里获得了 50 万美元的支持;- - http://news.ycombinator.com/item?id=4944322珠穆朗玛峰的 20 亿像素照片(你能找到登山者吗?) - - http://news.ycombinator.com/item?id=4943361...

如您所见,中间项 link 以某种方式被省略了.也就是说,link 的结果值在某种程度上是一个空字符串.那是为什么?

当我深入研究 soup 中的内容时,我意识到它在解析 XML 时在某种程度上令人窒息.这可以通过查看 items 中的第一项来看出:

<预><代码>>>>打印项目[0]<item><title>EFF 专利项目从 Mark Cuban 和 &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments>;<description>...</description></item>

您会注意到仅使用 link 标签就会发生一些奇怪的事情.它只获取关闭标签,然后是该标签的文本.这是一些非常奇怪的行为,尤其是与 titlecomments 的解析没有问题相比.

这似乎是 BeautifulSoup 的问题,因为请求实际读取的内容没有任何问题.我不认为它仅限于 BeautifulSoup,因为我也尝试使用 xml.etree.ElementTree API 并且出现了同样的问题(BeautifulSoup 是基于此 API 构建的吗?).

有谁知道为什么会发生这种情况,或者我如何仍然使用 BeautifulSoup 而不会出现此错误?

注意:我终于能够使用 xml.dom.minidom 获得我想要的东西,但这似乎不是一个强烈推荐的库.如果可能,我想继续使用 BeautifulSoup.

更新:我在 Mac 上运行 OSX 10.8,使用 Python 2.7.2 和 BS4 4.1.3.

更新 2:我有 lxml,它是用 pip 安装的.它是 3.0.2 版.至于lib​​xml,我检查了/usr/lib,显示的是libxml2.2.dylib.不确定何时或如何安装.

解决方案

哇,好问题.这让我觉得是 BeautifulSoup 中的一个错误.您无法使用 soup.find_all('item').link 访问链接的原因是,当您第一次将 html 加载到 BeautifulSoup 时,它对 HTML 做了一些奇怪的事情:

>>>从 bs4 导入 BeautifulSoup 作为 BS>>>BS(html)<html><body><rss version="2.0"><频道><title>黑客新闻</title><link/>http://news.ycombinator.com/<description>Links为求知欲强的人而设,由读者排名.</description><项目><title>EFF 专利项目从马克·库班和不"那里获得了 50 万美元的支持tch'<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>评论]]></description></项目><项目><title>珠穆朗玛峰的 20 亿像素照片(你能找到登山者吗?)</tit><link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html<comments>http://news.ycombinator.com/item?id=4943361</comments><description>评论]]></description></项目>...</频道></rss></body></html>

仔细看--它实际上已经将第一个 标记更改为 ,然后删除了 ; 标签.我不确定它为什么会这样做,但如果不解决 BeautifulSoup.BeautifulSoup 类初始化中的问题,您现在将无法使用它.

更新:

我认为你现在最好的(虽然是 hack-y)赌注是对 link 使用以下内容:

<预><代码>>>>汤.find('item').link.next_sibling你'http://news.ycombinator.com/'

YCombinator is nice enough to provide an RSS feed and a big RSS feed containing the top items on HackerNews. I am trying to write a python script to access the RSS feed document and then parse out certain pieces of information using BeautifulSoup. However, I am getting some strange behavior when BeautifulSoup tries to get the content of each of the items.

Here are a few sample lines of the RSS feed:

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

Here is the code I have written (in python) to access this feed and print out the title, link, and comments for each item:

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

However, this script is giving output that looks like this:

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

As you can see, the middle item, link, is somehow being omitted. That is, the resulting value of link is somehow an empty string. So why is that?

As I dig into what is in soup, I realize that it is somehow choking when it parses the XML. This can be seen by looking at at the first item in items:

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

You'll notice that something screwy is happening with just the link tag. It just gets the close tag and then the text for that tag after it. This is some very strange behavior especially in contrast to title and comments being parsed without a problem.

This seems to be a problem with BeautifulSoup because what is actually read in by requests doesn't have any problems with it. I don't think it is limited to BeautifulSoup though because I tried using xml.etree.ElementTree API as well and the same problem arose (is BeautifulSoup built on this API?).

Does anyone know why this would be happening or how I can still use BeautifulSoup without getting this error?

Note: I was finally able to get what I wanted with xml.dom.minidom, but this doesn't seem like a highly recommended library. I would like to continue using BeautifulSoup if possible.

Update: I am on a Mac with OSX 10.8 using Python 2.7.2 and BS4 4.1.3.

Update 2: I have lxml and it was installed with pip. It is version 3.0.2. As far as libxml, I checked in /usr/lib and the one that shows up is libxml2.2.dylib. Not sure when or how that was installed.

解决方案

Wow, great question. This strikes me as a bug in BeautifulSoup. The reason that you can't access the link using soup.find_all('item').link is that when you first load the html into BeautifulSoup to begin with, it does something odd to the HTML:

>>> from bs4 import BeautifulSoup as BS
>>> BS(html)
<html><body><rss version="2.0">
<channel>
<title>Hacker News</title><link/>http://news.ycombinator.com/<description>Links
for the intellectually curious, ranked by readers.</description>
<item>
<title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and 'No
tch'</title>
<link/>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-d
ollar-boost-mark-cuban-and-notch
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
<description>Comments]]&gt;</description>
</item>
<item>
<title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</ti
tle>
<link/>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_
050112_8bit_FLAT.html
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
<description>Comments]]&gt;</description>
</item>
...
</channel>
</rss></body></html>

Look carefully--it has actually changed the first <link> tag to <link/> and then removed the </link> tag. I'm not sure why it would do this, but without fixing the problem in the BeautifulSoup.BeautifulSoup class initialization, you're not going to be able to use it for now.

Update:

I think your best (albeit hack-y) bet for now is to use the following for link:

>>> soup.find('item').link.next_sibling
u'http://news.ycombinator.com/'

这篇关于为什么 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆