使用BeautifulSoup刮擦一系列桌子 [英] Scrape a series of tables with BeautifulSoup
问题描述
我正在尝试学习网络抓取和python(以及与此相关的程序设计),并发现了BeautifulSoup库,该库似乎提供了很多可能性.
I am trying to learn about web scraping and python (and programming for that matter) and have found the BeautifulSoup library which seems to offer a lot of possibilities.
我正在尝试找出如何最好地从此页面中提取相关信息:
I am trying to find out how to best pull the pertinent information from this page:
http://www.aidn.org.au/Industry- ViewCompany.asp?CID = 3113
我可以对此进行详细介绍,但基本上是公司名称,描述,联系方式,各种公司详细信息/统计信息等等.
I can go into more detail on this, but basically the company name, the description about it, contact details, the various company details / statistics e.t.c.
在此阶段,我们将研究如何彻底隔离这些数据并将其抓取,以期将其全部以CSV或以后的格式保存.
At this stage looking at how to cleanly isolate this data and scrape it, with the view to put it all in a CSV or something later.
我很困惑如何使用BS来获取不同的表数据.有很多tr和td标签,不确定如何锚定到任何唯一的标签上.
I am confused how to use BS to grab the different table data. There are lots of tr and td tags and not sure how to anchor on to anything unique.
我想出的最好的方法是以下代码作为开始:
The best I have come up with is the following code as a start:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113")
soup = BeautifulSoup(html)
soupie = soup.prettify()
print soupie
,然后从那里使用regex e.t.c.从清理后的文本中提取数据.
and then from there use regex e.t.c. to pull data from the cleaned up text.
但是使用BS树必须有更好的方法吗?还是以BS不会提供更多帮助的方式来格式化此网站?
But there must be a better way to do this using the BS tree? Or is this site formatted in a way that BS won't provide much more help?
不希望找到一个完整的解决方案,因为这是一个很大的问题,我想学习,但是任何使我步入正轨的代码片段都将不胜感激.
Not looking for a full solution as that is a big ask and I want to learn, but any code snippets to get me on my way would be much appreciated.
更新
感谢下面的@ZeroPiraeus,我开始了解如何解析表格.这是他的代码的输出:
Thanks to @ZeroPiraeus below I am starting to understand how to parse through the tables. Here is the output from his code:
=== Personnel ===
bodytext Ms Gail Morgan CEO
bodytext Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
bodytext Lisa Mayoh Sales Manager
bodytext Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: bob@aerospacematerials.com.au
=== Company Details ===
bodytext ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: lisa@aerospacematerials.com.au Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
paraheading ACN:
bodytext 007 350 807
paraheading ABN:
bodytext 71 007 350 807
paraheading
bodytext Australian Owned
paraheading Annual Turnover:
bodytext $5M - $10M
paraheading Number of Employees:
bodytext 6-10
paraheading QA:
bodytext ISO9001-2008, AS9120B,
paraheading Export Percentage:
bodytext 5 %
paraheading Industry Categories:
bodytext AerospaceLand (Vehicles, etc)LogisticsMarineProcurement
paraheading Company Email:
bodytext lisa@aerospacematerials.com.au
paraheading Company Website:
bodytext http://www.aerospacematerials.com.au
paraheading Office:
bodytext 2/6 Ovata Drive Tullamarine VIC 3043
paraheading Post:
bodytext PO Box 188 TullamarineVIC 3043
paraheading Phone:
bodytext +61.3. 9464 4455
paraheading Fax:
bodytext +61.3. 9464 4422
我的下一个问题是,将数据放入适合导入电子表格的CSV的最佳方法是什么?例如,拥有"ABN","ACN",公司网站"等内容作为列标题,然后将相应的数据作为行信息.
My next question is, what is the best way to put this data into a CSV which would be suitable for importing into a spreadsheet? For example having things like 'ABN' 'ACN' 'Company Website' e.t.c. as column headings and then the corresponding data as row information.
感谢您的帮助.
推荐答案
您的代码将完全取决于您想要的内容和存储方式,但是此代码段应使您了解如何获取相关信息页面的
Your code will depend on exactly what you want and how you want to store it, but this snippet should give you an idea how you can get the relevant information out of the page:
import requests
from bs4 import BeautifulSoup
url = "http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113"
html = requests.get(url).text
soup = BeautifulSoup(html)
for feature_heading in soup.find_all("td", {"class": "Feature-Heading"}):
print "\n=== %s ===" % feature_heading.text
details = feature_heading.find_next_sibling("td")
for item in details.find_all("td", {"class": ["bodytext", "paraheading"]}):
print("\t".join([item["class"][0], " ".join(item.text.split())]))
我发现 requests
比
I find requests
a more pleasant library to work with than urllib2
, but of course that's up to you.
为回答您的后续问题,您可以使用以下方法从抓取的数据写入CSV文件:
In response to your followup question, here's something you could use to write a CSV file from the scraped data:
import csv
import requests
from bs4 import BeautifulSoup
columns = ["ACN", "ABN", "Annual Turnover", "QA"]
urls = ["http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113", ] # ... etc.
with open("data.csv", "w") as csv_file:
writer = csv.DictWriter(csv_file, columns)
writer.writeheader()
for url in urls:
soup = BeautifulSoup(requests.get(url).text)
row = {}
for heading in soup.find_all("td", {"class": "paraheading"}):
key = " ".join(heading.text.split()).rstrip(":")
if key in columns:
next_td = heading.find_next_sibling("td", {"class": "bodytext"})
value = " ".join(next_td.text.split())
row[key] = value
writer.writerow(row)
这篇关于使用BeautifulSoup刮擦一系列桌子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!