如何使用Python浏览具有分页内容的HTMl页面? [英] How to navigate through HTMl pages that have paging for their content using Python?

查看:67
本文介绍了如何使用Python浏览具有分页内容的HTMl页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从以下网站抓取所有表条目(描述S/No.,文件编号等的表)并将其写入excel.到目前为止,我只能从第一页(10个条目)中检索数据.任何人都可以通过python代码段帮助我,以从该网站的第一页到最后一页抓取数据.

I want to crawl all the table entries(table that describes the S/No. , Document No., etc.) from the following website and write it to excel. So far, I am able to crawl the data from the first page (10 entries) only. Can anyone please help me with the python piece of code to crawl the data from first to last page on this website.

网站: https://www.gebiz .gov.sg/scripts/main.do?sourceLocation = openarea& select = tenderId

我的python代码:

My python code:

from bs4 import BeautifulSoup
import requests
import sys
import mechanize
import pprint
import re
import csv
import urllib
import urllib2

browser = mechanize.Browser()
browser.set_handle_robots(False)
url = 'https://www.gebiz.gov.sg/scripts/main.do?sourceLocation=openarea&select=tenderId'
response = browser.open(url)
html_doc = response.read()

rows_list = []
table_dict = {}

soup = BeautifulSoup(html_doc)

table = soup.find("table", attrs={"width": "100%", "border": "0", "cellspacing": "2", "cellpadding": "3", "bgcolor": "#FFFFFF"})
tr_elements = table.find_all("tr", class_=re.compile((ur'(row_even|row_odd|header_subone)')))

table_rows = []

for i in range(0, len(tr_elements)):
    tr_element = tr_elements[i]
    td_elements_in_tr_element = tr_element.find_all("td")
    rows_list.append([])

    for j in range(0, len(td_elements_in_tr_element)):
        td_element = td_elements_in_tr_element[j]
        table_elements_in_td_element = td_element.find_all("table")

    if len(table_elements_in_td_element) > 0:
                   continue
                   rows_list[i].append(td_element.text)
                   pprint.pprint(len(table_elements_in_td_element))
pprint.pprint(rows_list)

rows_list.remove([])

for row in rows_list:
table_dict[row[0]] = {
            #'S/No.' : row[1],
    'Document No.': row[1] + row[2],
        'Tenders and Quotations': row[3] + row[4],
    'Publication Date': row[5],
    'Closing Date': row[6],
    'Status': row[7]
}

pprint.pprint(table_dict)

with open('gebiz.csv', 'wb') as csvfile:
    csvwriter = csv.writer(csvfile, dialect='excel')

    for key in sorted(table_dict.iterkeys()):
         csvwriter.writerow([table_dict[key]['Document No.'], table_dict[key]['Tenders and Quotations'], table_dict[key]['Publication Date'], table_dict[key]['Closing Date'], table_dict[key]['Status']])

我们将对您的每一个帮助表示赞赏.

Every help from your side will be highly appreciated.

推荐答案

正如我在

As I can see in this page, you need to interact with java script that is invoked by button Go or Next Page button. For Go button you need to fill the textbox each time. You can use different approaches to work around this:

1) Selenium-Web浏览器自动化

2) spynner-带有针对Python的AJAX支持的程序化网络浏览模块此处

3)如果您熟悉c#,它还会提供一个此处).您可以保存每个页面的html内容,以后再从离线页面抓取它们.

3) If you are familiar with c#, it also provide a webBrowser component that helps you to click on the html elements. (e.g. here). You save html content of each page and later on crawl them from offline pages.

这篇关于如何使用Python浏览具有分页内容的HTMl页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆