Python beautifulsoup遍历表 [英] Python beautifulsoup iterate over table
问题描述
我正在尝试将表格数据抓取到CSV文件中.不幸的是,我遇到了障碍,下面的代码只是为所有后续TR重复了第一个TR的TD.
I am trying to scrape table data into a CSV file. Unfortunately, I've hit a road block and the following code simply repeats the TD from the first TR for all subsequent TRs.
import urllib.request
from bs4 import BeautifulSoup
f = open('out.txt','w')
url = "http://www.international.gc.ca/about-a_propos/atip-aiprp/reports-rapports/2012/02-atip_aiprp.aspx"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page)
soup.unicode
table1 = soup.find("table", border=1)
table2 = soup.find('tbody')
table3 = soup.find_all('tr')
for td in table3:
rn = soup.find_all("td")[0].get_text()
sr = soup.find_all("td")[1].get_text()
d = soup.find_all("td")[2].get_text()
n = soup.find_all("td")[3].get_text()
print(rn + "," + sr + "," + d + ",", file=f)
这是我有史以来第一个Python脚本,因此对您的帮助将不胜感激!我查看了其他问题的答案,但无法弄清楚我在做什么错.
This is my first ever Python script so any help would be appreciated! I have looked over other question answers but cannot figure out what I am doing wrong here.
推荐答案
每次使用find()
或find_all()
时,您都从文档的顶层开始,因此,例如,当您要求全部"td"标签将获得文档中所有
You're starting at the top level of your document each time you use find()
or find_all()
, so when you ask for, for example, all the "td"` tags you're getting all the "td" tags in the document, not just those in the table and row you have searched for. You might as well not search for those because they're not being used the way your code is written.
我认为您想做这样的事情:
I think you want to do something like this:
table1 = soup.find("table", border=1)
table2 = table1.find('tbody')
table3 = table2.find_all('tr')
或者,您知道的更像这样,具有更多描述性的变量名可以启动:
Or, you know, something more like this, with more descriptive variable names to boot:
rows = soup.find("table", border=1).find("tbody").find_all("tr")
for row in rows:
cells = row.find_all("td")
rn = cells[0].get_text()
# and so on
这篇关于Python beautifulsoup遍历表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!