如何从蟒蛇美丽的汤从表TBODY? [英] how to get tbody from table from python beautiful soup ?
问题描述
我想放弃年度&安培;获奖者(第一&安培;第二列)从表(第二个表)的总决赛比赛名单,从
http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals :我使用的是code如下:
进口的urllib2
从BeautifulSoup进口BeautifulSoupURL =http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm
汤= BeautifulSoup(urllib2.urlopen(URL).read())
soup.findAll('表')[0] .tbody.findAll(TR)
在soup.findAll('表')[0] .tbody.findAll('TR')行:
FIRST_COLUMN = row.findAll('日')[0] .contents
third_column = row.findAll('TD')[2] .contents
打印FIRST_COLUMN,third_column
通过上面的code,我能得到第一和放大器; THRID列就好了。但是,当我用同样的code。与 http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals
,它找不到TBODY作为它的元素,但我可以看到TBODY当我检查的元素。
URL =http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals
汤= BeautifulSoup(urllib2.urlopen(URL).read())打印soup.findAll('表')[2] soup.findAll('表')[2] .tbody.findAll(TR)
在soup.findAll('表')[0] .tbody.findAll('TR')行:
FIRST_COLUMN = row.findAll('日')[0] .contents
third_column = row.findAll('TD')[2] .contents
打印FIRST_COLUMN,third_column
下面是我从评论的错误了:
-------------------------------------------------- -------------------------
AttributeError的回溯(最新最后调用)
< IPython的输入-150-fedd08c6da16>上述<模块>()
7#打印soup.findAll('表')[2]
8
----> 9 soup.findAll('表')[2] .tbody.findAll(TR)
10在soup.findAll行('表')[0] .tbody.findAll(TR):
11 FIRST_COLUMN = row.findAll('日')[0] .contentsAttributeError异常:'NoneType'对象有没有属性'的findAll
如果您通过在浏览器中检查工具检查它会插入 TBODY
标记。
源$ C $ C,可以,或可以不包含它们。我建议在看源代码视图,如果你真的想知道的。
无论哪种方式,你并不需要遍历到TBODY,简单地说:
soup.findAll('表')[0] .findAll('TR')
应该工作。
I'm trying to scrap Year & Winners ( first & second columns ) from "List of finals matches" table (second table) from http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals: I'm using the code below:
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
soup.findAll('table')[0].tbody.findAll('tr')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print first_column, third_column
With the above code, I was able to get first & thrid column just fine. But when I use the same code with http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals
, It could not find tbody as its element, but I can see the tbody when I inspect the element.
url = "http://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals"
soup = BeautifulSoup(urllib2.urlopen(url).read())
print soup.findAll('table')[2]
soup.findAll('table')[2].tbody.findAll('tr')
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print first_column, third_column
Here's what I got from comment error:
'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-150-fedd08c6da16> in <module>()
7 # print soup.findAll('table')[2]
8
----> 9 soup.findAll('table')[2].tbody.findAll('tr')
10 for row in soup.findAll('table')[0].tbody.findAll('tr'):
11 first_column = row.findAll('th')[0].contents
AttributeError: 'NoneType' object has no attribute 'findAll'
'
If you are inspecting through the inspect tool in the browser it will insert the tbody
tags.
The source code, may, or may not contain them. I suggest looking at the source view if you really want to know.
Either way, you do not need to traverse to the tbody, simply:
soup.findAll('table')[0].findAll('tr')
should work.
这篇关于如何从蟒蛇美丽的汤从表TBODY?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!