使用BeautifulSoup刮擦一系列桌子 [英] Scrape a series of tables with BeautifulSoup

查看:88
本文介绍了使用BeautifulSoup刮擦一系列桌子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习网络抓取和python(以及与此相关的程序设计),并发现了BeautifulSoup库,该库似乎提供了很多可能性.

I am trying to learn about web scraping and python (and programming for that matter) and have found the BeautifulSoup library which seems to offer a lot of possibilities.

我正在尝试找出如何最好地从此页面中提取相关信息:

I am trying to find out how to best pull the pertinent information from this page:

http://www.aidn.org.au/Industry- ViewCompany.asp?CID = 3113

我可以对此进行详细介绍,但基本上是公司名称,描述,联系方式,各种公司详细信息/统计信息等等.

I can go into more detail on this, but basically the company name, the description about it, contact details, the various company details / statistics e.t.c.

在此阶段,我们将研究如何彻底隔离这些数据并将其抓取,以期将其全部以CSV或以后的格式保存.

At this stage looking at how to cleanly isolate this data and scrape it, with the view to put it all in a CSV or something later.

我很困惑如何使用BS来获取不同的表数据.有很多tr和td标签,不确定如何锚定到任何唯一的标签上.

I am confused how to use BS to grab the different table data. There are lots of tr and td tags and not sure how to anchor on to anything unique.

我想出的最好的方法是以下代码作为开始:

The best I have come up with is the following code as a start:

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113")
soup = BeautifulSoup(html)
soupie = soup.prettify()
print soupie

,然后从那里使用regex e.t.c.从清理后的文本中提取数据.

and then from there use regex e.t.c. to pull data from the cleaned up text.

但是使用BS树必须有更好的方法吗?还是以BS不会提供更多帮助的方式来格式化此网站?

But there must be a better way to do this using the BS tree? Or is this site formatted in a way that BS won't provide much more help?

不希望找到一个完整的解决方案,因为这是一个很大的问题,我想学习,但是任何使我步入正轨的代码片段都将不胜感激.

Not looking for a full solution as that is a big ask and I want to learn, but any code snippets to get me on my way would be much appreciated.

更新

感谢下面的@ZeroPiraeus,我开始了解如何解析表格.这是他的代码的输出:

Thanks to @ZeroPiraeus below I am starting to understand how to parse through the tables. Here is the output from his code:

=== Personnel ===
bodytext    Ms Gail Morgan CEO
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
bodytext    Lisa Mayoh Sales Manager
bodytext    Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: bob@aerospacematerials.com.au

=== Company Details ===
bodytext    ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: lisa@aerospacematerials.com.au Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422
paraheading ACN:
bodytext    007 350 807
paraheading ABN:
bodytext    71 007 350 807
paraheading 
bodytext    Australian Owned
paraheading Annual Turnover:
bodytext    $5M - $10M
paraheading Number of Employees:
bodytext    6-10
paraheading QA:
bodytext    ISO9001-2008, AS9120B,
paraheading Export Percentage:
bodytext    5 %
paraheading Industry Categories:
bodytext    AerospaceLand (Vehicles, etc)LogisticsMarineProcurement
paraheading Company Email:
bodytext    lisa@aerospacematerials.com.au
paraheading Company Website:
bodytext    http://www.aerospacematerials.com.au
paraheading Office:
bodytext    2/6 Ovata Drive Tullamarine VIC 3043
paraheading Post:
bodytext    PO Box 188 TullamarineVIC 3043
paraheading Phone:
bodytext    +61.3. 9464 4455
paraheading Fax:
bodytext    +61.3. 9464 4422

我的下一个问题是,将数据放入适合导入电子表格的CSV的最佳方法是什么?例如,拥有"ABN","ACN",公司网站"等内容作为列标题,然后将相应的数据作为行信息.

My next question is, what is the best way to put this data into a CSV which would be suitable for importing into a spreadsheet? For example having things like 'ABN' 'ACN' 'Company Website' e.t.c. as column headings and then the corresponding data as row information.

感谢您的帮助.

推荐答案

您的代码将完全取决于您想要的内容和存储方式,但是此代码段应使您了解如何获取相关信息页面的

Your code will depend on exactly what you want and how you want to store it, but this snippet should give you an idea how you can get the relevant information out of the page:

import requests

from bs4 import BeautifulSoup

url = "http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113"
html = requests.get(url).text
soup = BeautifulSoup(html)

for feature_heading in soup.find_all("td", {"class": "Feature-Heading"}):
    print "\n=== %s ===" % feature_heading.text
    details = feature_heading.find_next_sibling("td")
    for item in details.find_all("td", {"class": ["bodytext", "paraheading"]}):
        print("\t".join([item["class"][0], " ".join(item.text.split())]))

我发现 requests ,但这当然取决于您.

I find requests a more pleasant library to work with than urllib2, but of course that's up to you.

为回答您的后续问题,您可以使用以下方法从抓取的数据写入CSV文件:

In response to your followup question, here's something you could use to write a CSV file from the scraped data:

import csv
import requests

from bs4 import BeautifulSoup

columns = ["ACN", "ABN", "Annual Turnover", "QA"]
urls = ["http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113", ] # ... etc.

with open("data.csv", "w") as csv_file:
    writer = csv.DictWriter(csv_file, columns)
    writer.writeheader()
    for url in urls:
        soup = BeautifulSoup(requests.get(url).text)
        row = {}
        for heading in soup.find_all("td", {"class": "paraheading"}):
            key = " ".join(heading.text.split()).rstrip(":")
            if key in columns:
                next_td = heading.find_next_sibling("td", {"class": "bodytext"})
                value = " ".join(next_td.text.split())
                row[key] = value
        writer.writerow(row)

这篇关于使用BeautifulSoup刮擦一系列桌子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆