如何从蟒蛇美丽的汤从表TBODY? [英] how to get tbody from table from python beautiful soup ?

查看:134
本文介绍了如何从蟒蛇美丽的汤从表TBODY?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想放弃年度&安培;获奖者(第一&安培;第二列)从表(第二个表)的总决赛比赛名单,从
http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals :我使用的是code如下:

 进口的urllib2
从BeautifulSoup进口BeautifulSoupURL =htt​​p://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm
汤= BeautifulSoup(urllib2.urlopen(URL).read())
soup.findAll('表')[0] .tbody.findAll(TR)
在soup.findAll('表')[0] .tbody.findAll('TR')行:
    FIRST_COLUMN = row.findAll('日')[0] .contents
    third_column = row.findAll('TD')[2] .contents
    打印FIRST_COLUMN,third_column

通过上面的code,我能得到第一和放大器; THRID列就好了。但是,当我用同样的code。与 http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals ,它找不到TBODY作为它的元素,但我可以看到TBODY当我检查的元素。

  URL =htt​​p://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals
汤= BeautifulSoup(urllib2.urlopen(URL).read())打印soup.findAll('表')[2]    soup.findAll('表')[2] .tbody.findAll(TR)
    在soup.findAll('表')[0] .tbody.findAll('TR')行:
        FIRST_COLUMN = row.findAll('日')[0] .contents
        third_column = row.findAll('TD')[2] .contents
        打印FIRST_COLUMN,third_column

下面是我从评论的错误了:

 
-------------------------------------------------- -------------------------
AttributeError的回溯(最新最后调用)
< IPython的输入-150-fedd08c6da16>上述<模块>()
      7#打印soup.findAll('表')[2]
      8
----> 9 soup.findAll('表')[2] .tbody.findAll(TR)
     10在soup.findAll行('表')[0] .tbody.findAll(TR):
     11 FIRST_COLUMN = row.findAll('日')[0] .contentsAttributeError异常:'NoneType'对象有没有属性'的findAll


解决方案

如果您通过在浏览器中检查工具检查它会插入 TBODY 标记。

源$ C ​​$ C,可以,或可以不包含它们。我建议在看源代码视图,如果你真的想知道的。

无论哪种方式,你并不需要遍历到TBODY,简单地说:

soup.findAll('表')[0] .findAll('TR')应该工作。

I'm trying to scrap Year & Winners ( first & second columns ) from "List of finals matches" table (second table) from http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals: I'm using the code below:

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
soup.findAll('table')[0].tbody.findAll('tr')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

With the above code, I was able to get first & thrid column just fine. But when I use the same code with http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals, It could not find tbody as its element, but I can see the tbody when I inspect the element.

url = "http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals"
soup = BeautifulSoup(urllib2.urlopen(url).read())

print soup.findAll('table')[2]

    soup.findAll('table')[2].tbody.findAll('tr')
    for row in soup.findAll('table')[0].tbody.findAll('tr'):
        first_column = row.findAll('th')[0].contents
        third_column = row.findAll('td')[2].contents
        print first_column, third_column

Here's what I got from comment error:

'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-150-fedd08c6da16> in <module>()
      7 # print soup.findAll('table')[2]
      8 
----> 9 soup.findAll('table')[2].tbody.findAll('tr')
     10 for row in soup.findAll('table')[0].tbody.findAll('tr'):
     11     first_column = row.findAll('th')[0].contents

AttributeError: 'NoneType' object has no attribute 'findAll'

'

解决方案

If you are inspecting through the inspect tool in the browser it will insert the tbody tags.

The source code, may, or may not contain them. I suggest looking at the source view if you really want to know.

Either way, you do not need to traverse to the tbody, simply:

soup.findAll('table')[0].findAll('tr') should work.

这篇关于如何从蟒蛇美丽的汤从表TBODY?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆