为什么BeautifulSoup无法正确读取/解析这个RSS（XML）文档？ [英] Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

查看：476 发布时间：2016/8/5 19:04:41 python xml rss beautifulsoup

本文介绍了为什么BeautifulSoup无法正确读取/解析这个RSS（XML）文档？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

YCombinator是不够好，提供 RSS提要和的大RSS供稿包含在 HackerNews <顶级项目/ A>。我试图写一个python脚本访问RSS源文档，然后解析出使用BeautifulSoup的某些信息。不过，我得到一些奇怪的行为时BeautifulSoup试图让每个项目的内容。

YCombinator is nice enough to provide an RSS feed and a big RSS feed containing the top items on HackerNews. I am trying to write a python script to access the RSS feed document and then parse out certain pieces of information using BeautifulSoup. However, I am getting some strange behavior when BeautifulSoup tries to get the content of each of the items.

下面是RSS提要的几样行：

Here are a few sample lines of the RSS feed:

<rss version="2.0">
<channel>
<title>Hacker News</title><link>http://news.ycombinator.com/</link><description>Links for the intellectually curious, ranked by readers.</description>
<item>
    <title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title>
    <link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch</link>
    <comments>http://news.ycombinator.com/item?id=4944322</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4944322">Comments</a>]]></description>
</item>
<item>
    <title>Two Billion Pixel Photo of Mount Everest (can you find the climbers?)</title>
    <link>https://s3.amazonaws.com/Gigapans/EBC_Pumori_050112_8bit_FLAT/EBC_Pumori_050112_8bit_FLAT.html</link>
    <comments>http://news.ycombinator.com/item?id=4943361</comments>
    <description><![CDATA[<a href="http://news.ycombinator.com/item?id=4943361">Comments</a>]]></description>
</item>
...
</channel>
</rss>

下面是code我已经写了（在python）访问此饲料和打印出来的标题，链接和注释为每个项目：

Here is the code I have written (in python) to access this feed and print out the title, link, and comments for each item:

import sys
import requests
from bs4 import BeautifulSoup

request = requests.get('http://news.ycombinator.com/rss')
soup = BeautifulSoup(request.text)
items = soup.find_all('item')
for item in items:
    title = item.find('title').text
    link = item.find('link').text
    comments = item.find('comments').text
    print title + ' - ' + link + ' - ' + comments

不过，这个脚本是给输出看起来像这样的：

However, this script is giving output that looks like this:

EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39; -  - http://news.ycombinator.com/item?id=4944322
Two Billion Pixel Photo of Mount Everest (can you find the climbers?) -  - http://news.ycombinator.com/item?id=4943361
...

正如你所看到的，中间的元素，链接，以某种方式省略。也就是说，所产生的链接的值是有点空字符串。那么，为什么会这样？

As you can see, the middle item, link, is somehow being omitted. That is, the resulting value of link is somehow an empty string. So why is that?

当我深入到什么是在汤，我意识到，当它解析XML它是某种窒息。

As I dig into what is in soup, I realize that it is somehow choking when it parses the XML. This can be seen by looking at at the first item in items:

>>> print items[0]
<item><title>EFF Patent Project Gets Half-Million-Dollar Boost from Mark Cuban and &#39;Notch&#39;</title></link>https://www.eff.org/press/releases/eff-patent-project-gets-half-million-dollar-boost-mark-cuban-and-notch<comments>http://news.ycombinator.com/item?id=4944322</comments><description>...</description></item>

您会发现一些扭曲与刚链接标记发生。它只是获取结束标记，然后将文本后，该标记。这一点尤其在对比一些很奇怪的行为标题和注释被解析没有问题。

You'll notice that something screwy is happening with just the link tag. It just gets the close tag and then the text for that tag after it. This is some very strange behavior especially in contrast to title and comments being parsed without a problem.

此似乎与BeautifulSoup一个问题，因为什么是实际被请求读取在不具有任何问题。我不认为它是有限的BeautifulSoup，但因为我试图使用位置为xml.etree.ElementTree API以及和同样的问题又出现了（是BeautifulSoup建立在这个API？）。

This seems to be a problem with BeautifulSoup because what is actually read in by requests doesn't have any problems with it. I don't think it is limited to BeautifulSoup though because I tried using xml.etree.ElementTree API as well and the same problem arose (is BeautifulSoup built on this API?).

有谁知道为什么会是发生或如何我仍然可以使用BeautifulSoup没有得到这个错误？

Does anyone know why this would be happening or how I can still use BeautifulSoup without getting this error?

注：我终于能得到我想要用xml.dom.minidom，但是这似乎并不像一个高度推荐的图书馆。我想，如果继续使用可能BeautifulSoup。

Note: I was finally able to get what I wanted with xml.dom.minidom, but this doesn't seem like a highly recommended library. I would like to continue using BeautifulSoup if possible.

更新：我在Mac上使用OSX使用Python 2.7.2和4.1.3 BS4 10.8

Update: I am on a Mac with OSX 10.8 using Python 2.7.2 and BS4 4.1.3.

更新2 ：我有LXML它是随PIP安装。它是3.0.2版本。至于libxml的，我检查在/ usr / lib和显示起来是libxml2.2.dylib之一。不知道什么时候，或者是如何安装的。

Update 2: I have lxml and it was installed with pip. It is version 3.0.2. As far as libxml, I checked in /usr/lib and the one that shows up is libxml2.2.dylib. Not sure when or how that was installed.

为什么BeautifulSoup无法正确读取/解析这个RSS（XML）文档？ [英] Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么BeautifulSoup无法正确读取/解析这个RSS（XML）文档？ [英] Why is BeautifulSoup unable to correctly read/parse this RSS (XML) document?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭