如何使用Python浏览具有分页内容的HTMl页面? [英] How to navigate through HTMl pages that have paging for their content using Python?

查看：67 发布时间：2020/9/20 8:55:48 python web-scraping beautifulsoup

本文介绍了如何使用Python浏览具有分页内容的HTMl页面?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从以下网站抓取所有表条目(描述S/No.，文件编号等的表)并将其写入excel.到目前为止，我只能从第一页(10个条目)中检索数据.任何人都可以通过python代码段帮助我，以从该网站的第一页到最后一页抓取数据.

I want to crawl all the table entries(table that describes the S/No. , Document No., etc.) from the following website and write it to excel. So far, I am able to crawl the data from the first page (10 entries) only. Can anyone please help me with the python piece of code to crawl the data from first to last page on this website.

网站: https://www.gebiz .gov.sg/scripts/main.do?sourceLocation = openarea& select = tenderId

我的python代码:

My python code:

from bs4 import BeautifulSoup
import requests
import sys
import mechanize
import pprint
import re
import csv
import urllib
import urllib2

browser = mechanize.Browser()
browser.set_handle_robots(False)
url = 'https://www.gebiz.gov.sg/scripts/main.do?sourceLocation=openarea&select=tenderId'
response = browser.open(url)
html_doc = response.read()

rows_list = []
table_dict = {}

soup = BeautifulSoup(html_doc)

table = soup.find("table", attrs={"width": "100%", "border": "0", "cellspacing": "2", "cellpadding": "3", "bgcolor": "#FFFFFF"})
tr_elements = table.find_all("tr", class_=re.compile((ur'(row_even|row_odd|header_subone)')))

table_rows = []

for i in range(0, len(tr_elements)):
    tr_element = tr_elements[i]
    td_elements_in_tr_element = tr_element.find_all("td")
    rows_list.append([])

    for j in range(0, len(td_elements_in_tr_element)):
        td_element = td_elements_in_tr_element[j]
        table_elements_in_td_element = td_element.find_all("table")

    if len(table_elements_in_td_element) > 0:
                   continue
                   rows_list[i].append(td_element.text)
                   pprint.pprint(len(table_elements_in_td_element))
pprint.pprint(rows_list)

rows_list.remove([])

for row in rows_list:
table_dict[row[0]] = {
            #'S/No.' : row[1],
    'Document No.': row[1] + row[2],
        'Tenders and Quotations': row[3] + row[4],
    'Publication Date': row[5],
    'Closing Date': row[6],
    'Status': row[7]
}

pprint.pprint(table_dict)

with open('gebiz.csv', 'wb') as csvfile:
    csvwriter = csv.writer(csvfile, dialect='excel')

    for key in sorted(table_dict.iterkeys()):
         csvwriter.writerow([table_dict[key]['Document No.'], table_dict[key]['Tenders and Quotations'], table_dict[key]['Publication Date'], table_dict[key]['Closing Date'], table_dict[key]['Status']])

我们将对您的每一个帮助表示赞赏.

Every help from your side will be highly appreciated.

如何使用Python浏览具有分页内容的HTMl页面? [英] How to navigate through HTMl pages that have paging for their content using Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用Python浏览具有分页内容的HTMl页面? [英] How to navigate through HTMl pages that have paging for their content using Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭