如何使用requests.post获取网页? [英] How to get the web page using requests.post?

查看:17
本文介绍了如何使用requests.post获取网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想得到网页http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx输入股票代码为5的结果.>

问题是我在按搜索后不知道该网站,因为它运行的是 javascript.

此外,如何找到需要传递给requests.post的参数,例如数据?是否需要标题?

解决方案

您有多种选择:

1) 您可以使用硒.首先安装 Selenium.

sudo pip3 install selenium

然后获取驱动程序 https://sites.google.com/a/chromium.org/chromedriver/downloads(根据您的操作系统,您可能需要指定驱动程序的位置)

from selenium import webdriver从 bs4 导入 BeautifulSoup导入时间浏览器 = webdriver.Chrome()url = "http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx"browser.get(url)element = browser.find_element_by_id('ctl00_txt_stock_code') # 找到文本框时间.sleep(2)element.send_keys('5') # 填充文本框时间.sleep(2)element.submit() # 提交表单汤 = BeautifulSoup(browser.page_source, 'html.parser')浏览器退出()对于汤.find_all(class_='news') 中的新闻:打印(新闻.文本)

2) 或者将 PyQt 与 QWebEngineView 一起使用.

在 Ubuntu 上安装 PyQt:

 sudo apt-get install python3-pyqt5须藤 apt-get 安装 python3-pyqt5.qtwebengine

或在其他操作系统(64 位版本的 Python)上

 pip3 安装 PyQt5

基本上,您加载带有表单的第一页.通过运行 JavaScript 填写表单,然后提交.loadFinished() 信号被调用两次,第二次是因为您提交了表单,因此您可以使用 if 语句来区分调用.

导入系统从 PyQt5.QtWidgets 导入 QApplication从 PyQt5.QtCore 导入 QUrl从 PyQt5.QtWebEngineWidgets 导入 QWebEngineView从 bs4 导入 BeautifulSoup类渲染(QWebEngineView):def __init__(self, url):self.html = 无self.first_pass = Trueself.app = QApplication(sys.argv)QWebEngineView.__init__(self)self.loadFinished.connect(self._load_finished)self.load(QUrl(url))self.app.exec_()def _load_finished(self, result):如果 self.first_pass:self._first_finished()self.first_pass = False别的:self._second_finished()def_first_finished(self):self.page().runJavaScript("document.getElementById('ctl00_txt_stock_code').value = '5';")self.page().runJavaScript("document.getElementById('ctl00_sel_DateOfReleaseFrom_y').value='1999';")self.page().runJavaScript("preprocessMainForm();")self.page().runJavaScript("document.forms[0].submit();")def_second_finished(self):self.page().toHtml(self.callable)def 可调用(自我,数据):self.html = 数据self.app.quit()url = "http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx"网络 = 渲染(网址)汤 = BeautifulSoup(web.html, 'html.parser')对于汤.find_all(class_ = 'news') 中的新闻:打印(新闻.文本)

输出:

投票权和资本翌日披露报表赎回及取消上市通知2018年第三次中期股息履行管理职责的人的交易通知翌日披露报表2018年第三次中期股息截至 2018 年 10 月 31 日止月份的股票发行人的证券变动月报表投票权和资本发布基本招股说明书补充2018 年第三季度收益发布音频网络广播和电话会议第三季度财报发布 - 亮点以股代息通知书2018年第三次中期股息;以股代息2018 年股票股息替代方案的第三次中期股息主要控股通知2018 年第三季度收益发布主要控股通知股票发行人截至 2018 年 9 月 30 日止月份的证券变动月报表2018年第三次中期股息;优先股分红

或者你可以使用 Scrapy splash https://github.com/scrapy-plugins/scrapy-飞溅

或请求-HTML https://html.python-requests.org/ .

但我不确定您将如何使用最后两种方法填写表格.

更新了下一页的阅读方式:

导入系统从 PyQt5.QtWidgets 导入 QApplication从 PyQt5.QtCore 导入 QUrl从 PyQt5.QtWebEngineWidgets 导入 QWebEngineView从 bs4 导入 BeautifulSoup类渲染(QWebEngineView):def __init__(self, url):self.html = 无self.count = 0self.first_pass = Trueself.app = QApplication(sys.argv)QWebEngineView.__init__(self)self.loadFinished.connect(self._load_finished)self.load(QUrl(url))self.app.exec_()def _load_finished(self, result):如果 self.first_pass:self._first_finished()self.first_pass = False别的:self._second_finished()def_first_finished(self):self.page().runJavaScript("document.getElementById('ctl00_txt_stock_code').value = '5';")self.page().runJavaScript("document.getElementById('ctl00_sel_DateOfReleaseFrom_y').value='1999';")self.page().runJavaScript("preprocessMainForm();")self.page().runJavaScript("document.forms[0].submit();")def_second_finished(self):尝试:self.page().toHtml(self.parse)self.count += 1如果 self.count >5:self.page().toHtml(self.callable)别的:self.page().runJavaScript("document.getElementById('ctl00_btnNext2').click();")除了:self.page().toHtml(self.callable)def解析(自我,数据):汤 = BeautifulSoup(data, 'html.parser')对于汤.find_all(class_ = 'news') 中的新闻:打印(新闻.文本)def 可调用(自我,数据):self.app.quit()url = "http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx"网络 = 渲染(网址)

I want to get the result of the web page http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx with the input of stock code being 5.

The problem is that I don't know the website after pressing search as it runs a javascript.

Furthermore, how to find the parameters needed to pass to requests.post, e.g. data? Is header needed?

解决方案

You have multiple options:

1) You can use Selenium. First install Selenium.

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads (Depending upon your OS you may need to specify the location of your driver)

from selenium import webdriver
from bs4 import BeautifulSoup
import time

browser = webdriver.Chrome()
url = "http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx"
browser.get(url)
element = browser.find_element_by_id('ctl00_txt_stock_code')  # find the text box
time.sleep(2)
element.send_keys('5')  # populate the text box
time.sleep(2)
element.submit()  # submit the form
soup = BeautifulSoup(browser.page_source, 'html.parser')
browser.quit()
for news in soup.find_all(class_='news'):
    print(news.text)

2) Or use PyQt with QWebEngineView.

Install PyQt on Ubuntu:

    sudo apt-get install python3-pyqt5
    sudo apt-get install python3-pyqt5.qtwebengine

or on other OS (64 bit versions of Python)

    pip3 install PyQt5

Basically you load the first page with the form on. Fill in the form by running JavaScript then submit it. The loadFinished() signal is called twice, the second time because you submitted the form so you can use an if statement to differentiate between the calls.

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView
from bs4 import BeautifulSoup


class Render(QWebEngineView):
    def __init__(self, url):
        self.html = None
        self.first_pass = True
        self.app = QApplication(sys.argv)
        QWebEngineView.__init__(self)
        self.loadFinished.connect(self._load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _load_finished(self, result):
        if self.first_pass:
            self._first_finished()
            self.first_pass = False
        else:
            self._second_finished()

    def _first_finished(self):
        self.page().runJavaScript("document.getElementById('ctl00_txt_stock_code').value = '5';")
        self.page().runJavaScript("document.getElementById('ctl00_sel_DateOfReleaseFrom_y').value='1999';")
        self.page().runJavaScript("preprocessMainForm();")
        self.page().runJavaScript("document.forms[0].submit();")

    def _second_finished(self):
        self.page().toHtml(self.callable)

    def callable(self, data):
        self.html = data
        self.app.quit()

url = "http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx"
web = Render(url)
soup = BeautifulSoup(web.html, 'html.parser')
for news in soup.find_all(class_ = 'news'):
    print(news.text)

Outputs:

Voting Rights and Capital
Next Day Disclosure Return
NOTICE OF REDEMPTION AND CANCELLATION OF LISTING
THIRD INTERIM DIVIDEND FOR 2018
Notification of Transactions by Persons Discharging Managerial Responsibilities
Next Day Disclosure Return
THIRD INTERIM DIVIDEND FOR 2018
Monthly Return of Equity Issuer on Movements in Securities for the month ended 31 October 2018
Voting Rights and Capital
PUBLICATION OF BASE PROSPECTUS SUPPLEMENT
3Q 2018 EARNINGS RELEASE AUDIO WEBCAST AND CONFERENCE CALL
3Q EARNINGS RELEASE - HIGHLIGHTS
Scrip Dividend Circular
2018 Third Interim Dividend; Scrip Dividend
THIRD INTERIM DIVIDEND FOR 2018 SCRIP DIVIDEND ALTERNATIVE
NOTIFICATION OF MAJOR HOLDINGS
EARNINGS RELEASE FOR THIRD QUARTER 2018
NOTIFICATION OF MAJOR HOLDINGS
Monthly Return of Equity Issuer on Movements in Securities for the month ended 30 September 2018
THIRD INTERIM DIVIDEND FOR 2018; DIVIDEND ON PREFERENCE SHARES

Alternatively you can use Scrapy splash https://github.com/scrapy-plugins/scrapy-splash

Or Requests-HTML https://html.python-requests.org/ .

But I am not sure how you would fill the form in using these two last approaches.

Updated how to read the next pages:

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView
from bs4 import BeautifulSoup


class Render(QWebEngineView):
    def __init__(self, url):
    self.html = None
    self.count = 0
    self.first_pass = True
    self.app = QApplication(sys.argv)
    QWebEngineView.__init__(self)
    self.loadFinished.connect(self._load_finished)
    self.load(QUrl(url))
    self.app.exec_()

    def _load_finished(self, result):
    if self.first_pass:
        self._first_finished()
        self.first_pass = False
    else:
        self._second_finished()

    def _first_finished(self):
    self.page().runJavaScript("document.getElementById('ctl00_txt_stock_code').value = '5';")
    self.page().runJavaScript("document.getElementById('ctl00_sel_DateOfReleaseFrom_y').value='1999';")
    self.page().runJavaScript("preprocessMainForm();")
    self.page().runJavaScript("document.forms[0].submit();")

    def _second_finished(self):
    try:
        self.page().toHtml(self.parse)
        self.count += 1
        if self.count > 5:
             self.page().toHtml(self.callable)
        else:
            self.page().runJavaScript("document.getElementById('ctl00_btnNext2').click();")
    except:
        self.page().toHtml(self.callable)

    def parse(self, data):
    soup = BeautifulSoup(data, 'html.parser')
    for news in soup.find_all(class_ = 'news'):
        print(news.text)

    def callable(self, data):
    self.app.quit()

url = "http://www3.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main.aspx"
web = Render(url)

这篇关于如何使用requests.post获取网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆