Python + BeautifulSoup:从网页中抓取特定表 [英] Python+BeautifulSoup: scraping a particular table from a webpage

查看:519
本文介绍了Python + BeautifulSoup:从网页中抓取特定表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从中抓取特定表: 此网页

I'm trying to scrape a particular table from : this webpage

我要抓的是股票信息.日期,公司名称,比率以及是否可以选择.

What I want to scrape is the stock information. The dates, company name, ratio and whether or not it is optionable.

这是我到目前为止所拥有的:

Here's what I have so far:

from bs4 import BeautifulSoup
import urllib2

url = "http://biz.yahoo.com/c/s.html"
page = urllib2.urlopen(url) 
soup = BeautifulSoup(page.read())

alltables = soup.find_all('table')

这段代码为我提供了页面上的所有表格(不止一个).

This code gives me all the tables on the page (there is more than one).

1)我不确定如何识别所需的表.

1) I'm not sure how to identify the table that I need.

2)我不确定如何从该表中提取信息到数组或列表或其他可用于进一步分析的数据结构中.

2) I'm not sure how to extract the info from that table into an array or list or some other data structure I can use for further analysis.

推荐答案

标记并不是很容易抓取-没有id或特定的class属性可用于将表彼此区分开.在这种情况下,我要做的是找到一个Payable标头单元并找到第一个table

The markup is not exactly easy to scrape - there are no ids or specific class attributes that you can use to distinguish the tables from one another. What I would do in this case is to find a Payable header cell and find the first table parent:

header = soup.find("b", text="Payable")
table = header.find_parent("table")

然后,您可以跳过第一个2-标题和带分隔符的行来遍历表行:

Then, you can iterate over table rows skipping the first 2 - header and the row with the divider:

for row in table.find_all("tr")[2:]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

而且,您可以将其转换为列表列表:

And, you can transform it into a list of lists:

[[cell.get_text(strip=True) 
  for cell in row.find_all("td")]
 for row in table.find_all("tr")[2:]]

这篇关于Python + BeautifulSoup:从网页中抓取特定表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆