Python Requests/BeautifulSoup访问分页 [英] Python Requests/BeautifulSoup access to pagination
问题描述
我正在尝试访问网站的不同页面以获取项目列表(每页20个).有一个额外的参数要发送以选择页面,但是以某种方式我无法正确传递它-该参数必须在请求的正文中发送.我尝试使用params和数据,但没有成功.什么是向请求的正文"添加提示的正确方法是什么?
I am trying to access different pages of a website to get a list of items (20 per pages). There is one extra parameter to send to select the page but somehow i am not able to pass it along properly - the parameter has to be sent in the body of the request. I tried with params and with data without any success. What is the proper method to add soething to the "body" of a request?
这就是我所拥有的.第一页给了我6倍.
Here is what I have. It gives me 6 times the first page.
import requests
from bs4 import BeautifulSoup
import time
def SP_STK_LIST(numpage):
payload = {}
for i in range(0, numpage):
payload['ct100$Contenido$GoPag'] = i
header = {'Content-Type': 'text/html; charset=utf-8'}
req = requests.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, params = payload)
print(req.url)
page = req.text
soup = BeautifulSoup(page)
table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
for row in table.findAll("tr"):
print(row)
time.sleep(1)
SP_STK_LIST(6)
我认为我不太清楚我已经看到但似乎与我的问题无关的数据"与参数"甚至文件"之间的区别.
I don't think I understand clearly the differences between 'data' and 'params' or even 'files' which I have seen but does not seem(I think) to relate to my problem.
第一次我要感谢Selcuk的快速回答,我设法在我的系统上实现它,并且当jlaur退出时,它非常慢,尽管有"headless"选项,但仍有一个打开的命令框.屏幕.根据jlaur的建议,我想到了这一点:(仍然无法正常工作,但是我确信不会有太多的不足).
1st I want to thank Selcuk for its quick answer, I managed to implement it on my system and as jlaur poited out, it is extremely slow and despite the "headless" options, there is a command box that open on the screen. Using jlaur suggestion, I came up with this: (still not working but I am sure not much is missing from that).
import requests
from bs4 import BeautifulSoup
import time
import collections
def SPAIN_STK_LIST(numpage):
payload = collections.OrderedDict()
for i in range(0, numpage):
header = {'Content-Type': 'text/html; charset=utf-8'}
ses = requests.session()
req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", data = payload)
print(req.request.body)
page = req.text
soup = BeautifulSoup(page)
# find next __VIEWSTATE and __EVENTVALIDATION
viewstate = soup.find("input", {"id":"__VIEWSTATE"})['value']
print("VIEWSTATE: ", viewstate)
eventval = soup.find("input", {"id":"__EVENTVALIDATION"})['value']
print("EVENTVALIDATION: ", eventval)
payload['__EVENTTARGET'] = ""
payload['__EVENTARGUMENT'] = ""
payload['__VIEWSTATE'] = viewstate
payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
payload['__EVENTVALIDATION'] = eventval
payload['ct100$Contenido$GoPag'] = i + 1
table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
for row in table.findAll("tr")[1:]:
cells = row.findAll("td")
print(cells[0].find("a").get_text().replace(",","").replace("S.A.", ""))
time.sleep(1)
SPAIN_STK_LIST(3)
每个页面都会生成一个VIEWSTATE和EVENTVALIDATION数字,这些数字存储在页面的隐藏标签中.我用它们去下一页.我还按照建议使用了会话,但这仍然无法正常工作.但是,它具有与网页中的请求正文完全相同的格式(我使用ordereddict).有什么想法会丢失吗?
Each page generates an VIEWSTATE and EVENTVALIDATION numbers that are stored in hidden tags on the page. I use them to go to the next page. I also used session as suggested but this is still not working. However it has the exact same format (I used ordereddict) as the request body from the webpage. Any ideas what would be missing?
推荐答案
我不知道您是否可以使用Selenium,但是如果您要与页面进行交互,则应该使用.
I don't know if you can use Selenium but if you are going to interact with page you should.
您可以使用pip
我只将熊猫用于可视化目的.您不必使用它.
I used pandas for only visualisation purposes. You do not have to use it.
首先从此处下载用于硒的chrome驱动程序 https://chromedriver.storage.googleapis.com/index.html?path = 2.40/ 并将其解压缩到您的工作空间中,或者您可以在execute_path参数中指定它.完全由您决定.
First download chrome driver for selenium from here https://chromedriver.storage.googleapis.com/index.html?path=2.40/ and extract it to your workspace or you can specify it in executable_path parameter.It's up to you.
这将获取表中的所有数据,直到没有下一页为止.
This will get all the data in the table until there is no next page.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"chromedriver.exe",options=options)
driver.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx")
next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
data = []
try:
while (next_button):
soup = BeautifulSoup(driver.page_source,'html.parser')
table = soup.find('table',{'id':'ctl00_Contenido_tblEmisoras'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
#Wait for table to load
time.sleep(2)
next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
next_button.click()
except NoSuchElementException:
print('No more page to load')
df = pd.DataFrame(columns= ('Name','Sector - Subsector','Market','Indices'), data = data)
print(df.mask(df.eq('None')).dropna())
输出为
Name ... Indices
1 ABENGOA, S.A. ...
2 ABERTIS INFRAESTRUCTURAS, S.A. ...
3 ACCIONA,S.A. ... IBEX 35®, IBEX TOP Dividendo®
4 ACERINOX, S.A. ... IBEX 35®, IBEX TOP Dividendo®
5 ACS,ACTIVIDADES DE CONST.Y SERVICIOS S.A ... IBEX 35®, IBEX TOP Dividendo®
6 ADOLFO DOMINGUEZ, S.A. ...
7 ADVEO GROUP INTERNATIONAL, S.A. ...
8 AEDAS HOMES, S.A. ...
9 AENA, S.M.E., S.A. ... IBEX 35®
10 AIRBUS SE ...
11 ALANTRA PARTNERS, S.A. ...
12 ALFA, S.A.B. DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
13 ALMIRALL, S.A. ...
14 AMADEUS IT GROUP, S.A. ... IBEX 35®
15 AMERICA MOVIL, S.A.B. DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
16 AMPER, S.A. ...
17 APERAM, SOCIETE ANONYME ...
18 APPLUS SERVICES, S.A. ...
19 ARCELORMITTAL, S.A. ... IBEX 35®
20 ATRESMEDIA CORP. DE MEDIOS DE COM. S.A. ... IBEX TOP Dividendo®
21 AUDAX RENOVABLES, S.A. ...
22 AXIARE PATRIMONIO SOCIMI, S.A. ...
23 AYCO GRUPO INMOBILIARIO, S.A. ...
24 AZKOYEN S.A. ...
25 AZORA ALTUS, S.A. ...
27 BANCO BILBAO VIZCAYA ARGENTARIA, S.A. ... IBEX 35®, IBEX TOP Dividendo®
28 BANCO BRADESCO S.A. ... FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
29 BANCO DE SABADELL, S.A. ... IBEX 35®
30 BANCO SANTANDER RIO, S.A. ... FTSE Latibex All Share
31 BANCO SANTANDER, S.A. ... IBEX 35®, IBEX TOP Dividendo®
.. ... ... ...
145 RENTA 4 BANCO, S.A. ...
146 RENTA CORPORACION REAL ESTATE, S.A. ...
147 REPSOL, S.A. ... IBEX 35®, IBEX TOP Dividendo®
148 SACYR, S.A. ...
149 SAETA YIELD, S.A. ...
150 SARE HOLDING, S.A.B, DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
151 SERVICE POINT SOLUTIONS, S.A. ...
152 SIEMENS GAMESA RENEWABLE ENERGY, S.A. ... IBEX 35®
153 SNIACE, S.A. ...
154 SOLARIA ENERGIA Y MEDIO AMBIENTE, S.A. ...
155 TALGO, S.A. ...
157 TECNICAS REUNIDAS, S.A. ... IBEX 35®, IBEX TOP Dividendo®
158 TELEFONICA, S.A. ... IBEX 35®, IBEX TOP Dividendo®
159 TELEPIZZA GROUP, S.A. ...
160 TR HOTEL JARDIN DEL MAR, S.A. ...
161 TUBACEX, S.A. ...
162 TUBOS REUNIDOS,S.A. ...
163 TV AZTECA, S.A. DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
164 UNICAJA BANCO, S.A. ...
165 UNION CATALANA DE VALORES, S.A. ...
166 URBAR INGENIEROS, S.A. ...
167 URBAS GRUPO FINANCIERO, S.A. ...
168 USINAS SIDERURGICAS DE MINAS GERAIS,S.A. ... FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
169 VALE, S.A. ... FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
170 VERTICE TRESCIENTOS SESENTA GRADOS, S.A. ...
171 VIDRALA S.A. ...
172 VISCOFAN, S.A. ... IBEX 35®
173 VOCENTO, S.A. ...
174 VOLCAN, COMPAñIA MINERA S.A.A. ... FTSE Latibex All Share
175 ZARDOYA OTIS, S.A. ...
[169 rows x 4 columns]
这篇关于Python Requests/BeautifulSoup访问分页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!