Python BeautifulSoup:用相同的类名解析多个表 [英] Python BeautifulSoup: parsing multiple tables with same class name

查看:1233
本文介绍了Python BeautifulSoup:用相同的类名解析多个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从Wiki页面解析一些表格,例如 http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014
有四个具有相同类名wikitable的表。当我写作时:

  movieList = soup.find('table',{'class':'wikitable'})
rows = movieList.findAll('tr')

它工作正常,但是当我写入时:

$ p $ movieList = soup.findAll('table',{'class':'wikitable'})
rows =它会抛出一个错误:

  Traceback(最近一次调用最后一次):
在< module>文件中的第24行C:\Python27\movieList.py
rows = movieList.findAll('tr')
AttributeError:'ResultSet'对象没有属性'findAll'

当我打印movieList时,它会打印所有的四张表。



另外,如何有效地解析内容,连续的列是可变的?我想将这些信息存储到不同的变量中。

解决方案

findAll()返回一个 ResultSet 对象 - 基本上是元素列表。如果您想在 ResultSet 中的每个元素中查找元素,请使用循环:

  movie_list = soup.findAll('table',{'class':'wikitable'})
用于movie_list中的影片:
rows = movie.findAll('tr')
...

您也可以使用 CSS Selector ,但在这种情况下,要区分电影之间的行并不容易:

  rows = soup.select('table.wikitable tr')






作为奖励,以下是您如何收集所有的发布到字典中,其中键是句点,值是电影列表:

 来自pprint import pprint 
从bs4导入urllib2
导入BeautifulSoup

url ='http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014
soup = BeautifulSoup(urllib2.urlopen(url))

headers = ['Opening','Title','Genre','Director','Cast']
results = {}
for soup.select('div#mw-content-text> h3'):
title = block.find('span',class _ ='mw-headline')。text
rows = block.find_next_sibling('table',class _ ='wikitable')。find_all ('tr')

results [title] = [{header:td.text for header,td in zip(headers,row.find_all('td'))}
for row in row [1:]]

pprint(results)

让你更接近解决问题。


I am trying to parse some tables from a wiki page e.g. http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014. there are four tables with same class name "wikitable". When I write:

movieList= soup.find('table',{'class':'wikitable'}) 
rows = movieList.findAll('tr')

It works fine, but when I write:

movieList= soup.findAll('table',{'class':'wikitable'})
rows = movieList.findAll('tr')

It throws an error:

Traceback (most recent call last):
  File "C:\Python27\movieList.py", line 24, in <module>
    rows = movieList.findAll('tr')
AttributeError: 'ResultSet' object has no attribute 'findAll'

when I print movieList it prints all four table.

Also, how can I parse the content effectively because the no. of columns in a row is variable? I want to store this information into different variables.

解决方案

findAll() returns a ResultSet object - basically, a list of elements. If you want to find elements inside each of the element in the ResultSet - use a loop:

movie_list = soup.findAll('table', {'class': 'wikitable'})
for movie in movie_list:
    rows = movie.findAll('tr')
    ...

You could have also used a CSS Selector, but, in this case, it would not be easy to distinguish rows between movies:

rows = soup.select('table.wikitable tr')


As a bonus, here is how you can collect all of the "Releases" into a dictionary where the keys are the periods and the values are lists of movies:

from pprint import pprint
import urllib2
from bs4 import BeautifulSoup

url = 'http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014'
soup = BeautifulSoup(urllib2.urlopen(url))

headers = ['Opening', 'Title', 'Genre', 'Director', 'Cast']
results = {}
for block in soup.select('div#mw-content-text > h3'):
    title = block.find('span', class_='mw-headline').text
    rows = block.find_next_sibling('table', class_='wikitable').find_all('tr')

    results[title] = [{header: td.text for header, td in zip(headers, row.find_all('td'))}
                      for row in rows[1:]]

pprint(results)

This should get you much closer to solving the problem.

这篇关于Python BeautifulSoup:用相同的类名解析多个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆