用 python (HKEX) 抓取 .aspx 页面 [英] Scraping .aspx page with python (HKEX)
问题描述
我正在尝试废弃以下网站:http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main_c.aspx
I am trying to scrap the following website: http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main_c.aspx
我使用的是 python2.7,这是我的代码:
I'm using python2.7, Here is my code:
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,zh-TW;q=0.7,zh;q=0.6,zh-CN;q=0.5',}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'
myopener = MyOpener()
url = 'http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main_c.aspx'
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']
soup_dummy.find(id="aspnetForm")
formData = (
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR', viewstategen),
('ctl00$txt_stock_code', '00005')
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f,"html5lib")
date = soup.find("span", id="lbDateTime")
print(date)
什么都不能收集.当我运行此代码时,它显示无".如果我将打印(日期)更改为打印(日期.文本)发生错误:AttributeError: 'NoneType' 对象没有属性 'text'
Nothing can be collected. It shows "none" when I run this code. If I change print(date) to print(date.text) Error occur: AttributeError: 'NoneType' object has no attribute 'text'
推荐答案
你的问题有点含糊,但这是我的尝试:
Your question is a little vague, but here's my attempt:
运行您的代码后,我得到以下响应:请求的页面可能已从香港交易及结算所有限公司 (HKEX) 网站重新定位、重命名或删除.
Running your code gives me the following response: The page requested may have been relocated, renamed or removed from the Hong Kong Exchanges and Clearing Limited, or HKEX, website.
此外,我没有看到任何等于 lbDateTime
的跨度 ID.但是,我确实看到以 lbDateTime
结尾的跨度 ID.如果你没有收到这样的错误,你可以试试这个:dates = soup.findAll("span", {"id": lambda L: L and L.endswith('lbDateTime')})代码>
Additionally, I don't see any span ids equal to lbDateTime
. I do however see span ids that end with lbDateTime
. If you are not receiving such an error, you might try this instead: dates = soup.findAll("span", {"id": lambda L: L and L.endswith('lbDateTime')})
(来源:https://stackoverflow.com/a/14257743/942692)
如果您确实收到了相同的回复,则需要修正您的请求.我对 urllib
不熟悉,所以我无法为您提供帮助,但如果您能够使用 requests
库,这里有一些对我有用的代码:(dates
返回一个包含 20 个元素的 ResultSet 对象)
If you are indeed getting the same response, you will need to fix your request. I'm not familiar with urllib
so I can't help you there, but if you are able to use the requests
library instead, here's some code that works for me: (dates
returns a ResultSet object with 20 elements)
import requests
from bs4 import BeautifulSoup
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,zh-TW;q=0.7,zh;q=0.6,zh-CN;q=0.5'}
session = requests.session()
response = session.get('http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main_c.aspx', headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
soup = BeautifulSoup(response.content, 'html.parser')
form_data = {
'__VIEWSTATE': soup.find('input', {'name': '__VIEWSTATE'}).get('value'),
'__VIEWSTATEGENERATOR': soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value'),
'__VIEWSTATEENCRYPTED': soup.find('input', {'name': '__VIEWSTATEENCRYPTED'}).get('value')
}
f = session.post('http://www.hkexnews.hk/listedco/listconews/advancedsearch/search_active_main_c.aspx', data=form_data,
headers=headers)
soup = BeautifulSoup(f.content, 'html.parser')
dates = soup.findAll("span", {"id": lambda L: L and L.endswith('lbDateTime')})
这篇关于用 python (HKEX) 抓取 .aspx 页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!