Python beautifulsoup 遍历表 [英] Python beautifulsoup iterate over table

查看:37
本文介绍了Python beautifulsoup 遍历表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将表数据抓取到 CSV 文件中.不幸的是,我遇到了障碍,以下代码只是为所有后续 TR 重复第一个 TR 的 TD.

I am trying to scrape table data into a CSV file. Unfortunately, I've hit a road block and the following code simply repeats the TD from the first TR for all subsequent TRs.

import urllib.request
from bs4 import BeautifulSoup

f = open('out.txt','w')

url = "http://www.international.gc.ca/about-a_propos/atip-aiprp/reports-rapports/2012/02-atip_aiprp.aspx"
page = urllib.request.urlopen(url)

soup = BeautifulSoup(page)

soup.unicode

table1 = soup.find("table", border=1)
table2 = soup.find('tbody')
table3 = soup.find_all('tr')

for td in table3:
    rn = soup.find_all("td")[0].get_text()
    sr = soup.find_all("td")[1].get_text()
    d = soup.find_all("td")[2].get_text()
    n = soup.find_all("td")[3].get_text()

    print(rn + "," + sr + "," + d + ",", file=f)

这是我的第一个 Python 脚本,所以任何帮助将不胜感激!我查看了其他问题的答案,但无法弄清楚我在这里做错了什么.

This is my first ever Python script so any help would be appreciated! I have looked over other question answers but cannot figure out what I am doing wrong here.

推荐答案

每次使用 find()find_all(),因此,例如,当您要求所有td"标签时,您将获得文档中的所有td"标签,而不仅仅是表格和排行中的标签已经搜索过了.您最好不要搜索这些内容,因为它们的使用方式与您编写代码的方式不同.

You're starting at the top level of your document each time you use find() or find_all(), so when you ask for, for example, all the "td"` tags you're getting all the "td" tags in the document, not just those in the table and row you have searched for. You might as well not search for those because they're not being used the way your code is written.

我认为你想做这样的事情:

I think you want to do something like this:

table1 = soup.find("table", border=1)
table2 = table1.find('tbody')
table3 = table2.find_all('tr')

或者,你知道的,更像这样的,有更多描述性的变量名来启动:

Or, you know, something more like this, with more descriptive variable names to boot:

rows = soup.find("table", border=1).find("tbody").find_all("tr")

for row in rows:
    cells = row.find_all("td")
    rn = cells[0].get_text()
    # and so on

这篇关于Python beautifulsoup 遍历表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆