美丽的汤抛出`IndexError` [英] Beautiful Soup throws `IndexError`

查看:214
本文介绍了美丽的汤抛出`IndexError`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的的Python 2.7 美丽的汤3.2刮网站。我是新来的这两种语言,但是从文档我有点开始。

I am scraping a website using Python 2.7 and Beautiful Soup 3.2. I am new to both languages, but from the documentation I got a bit started.

我读下一单证:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation html的内容#
http://thepcspy.com/read/scraping-websites-with-python/

我做什么,现在有(失败部分):

What I do and have now (part that fails):

# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup

# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)

# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)

# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
hometeams = [tag.contents[1] for tag in hometeamsTd]

# From the source HTML page, search and store all <td class="home">..</td> and it's content
awayteamsTd = soup.findAll('td', { "class" : "away" })
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]

tag.contents 的内容在 hometeamsTd 是这样的:

[
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Harkemase Boys', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6077" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'RKC Waalwijk', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-427" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'PSV', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Ajax', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-2" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'SC Heerenveen', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-14" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Feyenoord', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-9" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />]
]

tag.contents 的内容在 awayteamsTd 是这样的:

[
    [u'Away-team'], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-13" />, u'NEC', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-11" />, u'Heracles', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428" />, u'Stormvogels Telstar', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-419" />, u'FC Volendam', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-7" />, u'FC Twente', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-415" />, u'FC Dordrecht', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />]
]

问题我试图解决,但不要完全得到却又是:

Problems I try to solve, but don't quite get yet is:


  • 的code awayteams = [tag.contents [1]在awayteamsTd标记] 得来的错误: IndexError:列表索引范围。这是ofcourse正确的,因为你可以在 tag.contents 的输出看到的 awayteamsTd 中,有一个第一项 [u'Away队'] 。这就是为什么它失败。但我怎么能删除/跳过这一项?

  • 在hometeams输出一切正常,但我想排除的选项,其中文本的荷兰KNVB贝克尔发生

  • The code awayteams = [tag.contents[1] for tag in awayteamsTd] throughs an error: IndexError: list index out of range. Which is ofcourse correct, because as you can see in the output of tag.contents for awayteamsTd, there is an first entry [u'Away-team']. This is why it is failing. But how can I remove/skip this one?
  • Within the hometeams output everything is working, but I would like to exclude the ones where the text Dutch KNVB Beker occurs

推荐答案

的问题是,离开单元格(列名)是一款TD里面加上离开类:

The problem is that the "away" cell (column name) is inside a td with "away" class:

<thead class="title">
    ...
    <tr class="sub">
      ...  
      <td>Home-team</td>
      <td></td>
      <td class="away">Away-team</td>
      <td class="broadcast">Broadcast</td>
    </tr>
  </thead>
</thead>

只需使用切片跳过它:

awayteamsTd = soup.findAll('td', { "class" : "away" })[1:]

另外,如果你想排除荷兰KNVB贝克尔从家里队的名单中,添加一个条件列表COM prehension前pression:

Also, if you want to exclude Dutch KNVB Beker from the list of home teams, add a condition to the list comprehension expression:

hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker']

这篇关于美丽的汤抛出`IndexError`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆