Python BeautifulSoup：用相同的类名解析多个表 [英] Python BeautifulSoup: parsing multiple tables with same class name

查看：1233 发布时间：2018/6/26 11:48:52 python html python-2.7 beautifulsoup html-parsing

本文介绍了Python BeautifulSoup：用相同的类名解析多个表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从Wiki页面解析一些表格，例如 http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014 。
有四个具有相同类名wikitable的表。当我写作时：

  movieList = soup.find（'table'，{'class'：'wikitable'}）
 rows = movieList.findAll（'tr'）

它工作正常，但是当我写入时：

$ p $ movieList = soup.findAll（'table'，{'class'：'wikitable'}） rows =它会抛出一个错误：

  Traceback（最近一次调用最后一次）：
在< module>文件中的第24行C：\Python27\movieList.py 
 rows = movieList.findAll（'tr'）
 AttributeError：'ResultSet'对象没有属性'findAll'

当我打印movieList时，它会打印所有的四张表。

另外，如何有效地解析内容，连续的列是可变的？我想将这些信息存储到不同的变量中。解决方案 findAll（）返回一个 ResultSet 对象 - 基本上是元素列表。如果您想在 ResultSet 中的每个元素中查找元素，请使用循环： movie_list = soup.findAll（'table'，{'class'：'wikitable'}）用于movie_list中的影片： rows = movie.findAll（'tr'） ... 您也可以使用 CSS Selector ，但在这种情况下，要区分电影之间的行并不容易： rows = soup.select（'table.wikitable tr'）作为奖励，以下是您如何收集所有的发布到字典中，其中键是句点，值是电影列表：来自pprint import pprint 从bs4导入urllib2 导入BeautifulSoup url ='http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014 soup = BeautifulSoup（urllib2.urlopen（url）） headers = ['Opening'，'Title'，'Genre'，'Director'，'Cast'] results = {} for soup.select（'div＃mw-content-text> h3'）： title = block.find（'span'，class _ ='mw-headline'）。text rows = block.find_next_sibling（'table'，class _ ='wikitable'）。find_all （'tr'） results [title] = [{header：td.text for header，td in zip（headers，row.find_all（'td'））} for row in row [1：]] pprint（results）让你更接近解决问题。 I am trying to parse some tables from a wiki page e.g. http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014. there are four tables with same class name "wikitable". When I write: movieList= soup.find('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It works fine, but when I write: movieList= soup.findAll('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It throws an error: Traceback (most recent call last): File "C:\Python27\movieList.py", line 24, in <module> rows = movieList.findAll('tr') AttributeError: 'ResultSet' object has no attribute 'findAll' when I print movieList it prints all four table. Also, how can I parse the content effectively because the no. of columns in a row is variable? I want to store this information into different variables. 解决方案 findAll() returns a ResultSet object - basically, a list of elements. If you want to find elements inside each of the element in the ResultSet - use a loop: movie_list = soup.findAll('table', {'class': 'wikitable'}) for movie in movie_list: rows = movie.findAll('tr') ... You could have also used a CSS Selector, but, in this case, it would not be easy to distinguish rows between movies: rows = soup.select('table.wikitable tr') As a bonus, here is how you can collect all of the "Releases" into a dictionary where the keys are the periods and the values are lists of movies: from pprint import pprint import urllib2 from bs4 import BeautifulSoup url = 'http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014' soup = BeautifulSoup(urllib2.urlopen(url)) headers = ['Opening', 'Title', 'Genre', 'Director', 'Cast'] results = {} for block in soup.select('div#mw-content-text > h3'): title = block.find('span', class_='mw-headline').text rows = block.find_next_sibling('table', class_='wikitable').find_all('tr') results[title] = [{header: td.text for header, td in zip(headers, row.find_all('td'))} for row in rows[1:]] pprint(results) This should get you much closer to solving the problem. 这篇关于Python BeautifulSoup：用相同的类名解析多个表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Python BeautifulSoup：用相同的类名解析多个表 [英] Python BeautifulSoup: parsing multiple tables with same class name

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python BeautifulSoup：用相同的类名解析多个表 [英] Python BeautifulSoup: parsing multiple tables with same class name

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭