为什么BeautifulSoup无法正确读取/解析这个RSS(XML)文档? [英] Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

查看:476
本文介绍了为什么BeautifulSoup无法正确读取/解析这个RSS(XML)文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

YCombinator是不够好,提供 RSS提要和的大RSS供稿包含在 HackerNews <顶级项目/ A>。我试图写一个python脚本访问RSS源文档,然后解析出使用BeautifulSoup的某些信息。不过,我得到一些奇怪的行为时BeautifulSoup试图让每个项目的内容。

YCombinator is nice enough to provide an RSS feed and a big RSS feed containing the top items on HackerNews. I am trying to write a python script to access the RSS feed document and then parse out certain pieces of information using BeautifulSoup. However, I am getting some strange behavior when BeautifulSoup tries to get the content of each of the items.

下面是RSS提要的几样行:

Here are a few sample lines of the RSS feed:

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

下面是code我已经写了(在python)访问此饲料和打印出来的标题链接注释为每个项目:

Here is the code I have written (in python) to access this feed and print out the title, link, and comments for each item:

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

不过,这个脚本是给输出看起来像这样的:

However, this script is giving output that looks like this:

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

正如你所看到的,中间的元​​素,链接,以某种方式省略。也就是说,所产生的链接的值是有点空字符串。那么,为什么会这样?

As you can see, the middle item, link, is somehow being omitted. That is, the resulting value of link is somehow an empty string. So why is that?

当我深入到什么是在,我意识到,当它解析XML它是某种窒息。

As I dig into what is in soup, I realize that it is somehow choking when it parses the XML. This can be seen by looking at at the first item in items:

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

您会发现一些扭曲与刚链接标记发生。它只是获取结束标记,然后将文本后,该标记。这一点尤其在对比一些很奇怪的行为标题注释被解析没有问题。

You'll notice that something screwy is happening with just the link tag. It just gets the close tag and then the text for that tag after it. This is some very strange behavior especially in contrast to title and comments being parsed without a problem.

此似乎与BeautifulSoup一个问题,因为什么是实际被请求读取在不具有任何问题。我不认为它是有限的BeautifulSoup,但因为我试图使用位置为xml.etree.ElementTree API以及和同样的问题又出现了(是BeautifulSoup建立在这个API?)。

This seems to be a problem with BeautifulSoup because what is actually read in by requests doesn't have any problems with it. I don't think it is limited to BeautifulSoup though because I tried using xml.etree.ElementTree API as well and the same problem arose (is BeautifulSoup built on this API?).

有谁知道为什么会是发生或如何我仍然可以使用BeautifulSoup没有得到这个错误?

Does anyone know why this would be happening or how I can still use BeautifulSoup without getting this error?

注:我终于能得到我想要用xml.dom.minidom,但是这似乎并不像一个高度推荐的图书馆。我想,如果继续使用可能BeautifulSoup。

Note: I was finally able to get what I wanted with xml.dom.minidom, but this doesn't seem like a highly recommended library. I would like to continue using BeautifulSoup if possible.

更新:我在Mac上使用OSX使用Python 2.7.2和4.1.3 BS4 10.8

Update: I am on a Mac with OSX 10.8 using Python 2.7.2 and BS4 4.1.3.

更新2 :我有LXML它是随PIP安装。它是3.0.2版本。至于lib​​xml的,我检查在/ usr / lib和显示起来是libxml2.2.dylib之一。不知道什么时候,或者是如何安装的。

Update 2: I have lxml and it was installed with pip. It is version 3.0.2. As far as libxml, I checked in /usr/lib and the one that shows up is libxml2.2.dylib. Not sure when or how that was installed.

推荐答案

其实,这个问题似乎与您正在使用的解析器有关。默认情况下,HTML一个被使用。在安装LXML模块后,尝试使用汤= BeautifulSoup(request.text,'XML')

Actually, the problem seems to be related with the parser you are using. By default, a HTML one is used. Try using soup = BeautifulSoup(request.text, 'xml') after installing the lxml module.

然后它会使用XML解析器,而不是HTML之一,它应该是一切ok。

It will then use a XML parser instead of a HTML one and it should be all ok.

请参阅 http://www.crummy.com/软件/ BeautifulSoup / BS4 / DOC /#安装-A-解析器更多信息

这篇关于为什么BeautifulSoup无法正确读取/解析这个RSS(XML)文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆