Python Requests/BeautifulSoup访问分页 [英] Python Requests/BeautifulSoup access to pagination

查看:94
本文介绍了Python Requests/BeautifulSoup访问分页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试访问网站的不同页面以获取项目列表(每页20个).有一个额外的参数要发送以选择页面,但是以某种方式我无法正确传递它-该参数必须在请求的正文中发送.我尝试使用params和数据,但没有成功.什么是向请求的正文"添加提示的正确方法是什么?

I am trying to access different pages of a website to get a list of items (20 per pages). There is one extra parameter to send to select the page but somehow i am not able to pass it along properly - the parameter has to be sent in the body of the request. I tried with params and with data without any success. What is the proper method to add soething to the "body" of a request?

这就是我所拥有的.第一页给了我6倍.

Here is what I have. It gives me 6 times the first page.

import requests
from bs4 import BeautifulSoup
import time

def SP_STK_LIST(numpage):
    payload = {}
    for i in range(0, numpage):
        payload['ct100$Contenido$GoPag'] = i
        header = {'Content-Type': 'text/html; charset=utf-8'}
        req = requests.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, params = payload)
        print(req.url)
        page = req.text
        soup = BeautifulSoup(page)            
        table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
        for row in table.findAll("tr"):
            print(row)
        time.sleep(1)

SP_STK_LIST(6)

我认为我不太清楚我已经看到但似乎与我的问题无关的数据"与参数"甚至文件"之间的区别.

I don't think I understand clearly the differences between 'data' and 'params' or even 'files' which I have seen but does not seem(I think) to relate to my problem.

第一次我要感谢Selcuk的快速回答,我设法在我的系统上实现它,并且当jlaur退出时,它非常慢,尽管有"headless"选项,但仍有一个打开的命令框.屏幕.根据jlaur的建议,我想到了这一点:(仍然无法正常工作,但是我确信不会有太多的不足).

1st I want to thank Selcuk for its quick answer, I managed to implement it on my system and as jlaur poited out, it is extremely slow and despite the "headless" options, there is a command box that open on the screen. Using jlaur suggestion, I came up with this: (still not working but I am sure not much is missing from that).

import requests
from bs4 import BeautifulSoup
import time
import collections

def SPAIN_STK_LIST(numpage):
    payload = collections.OrderedDict()
    for i in range(0, numpage):
        header = {'Content-Type': 'text/html; charset=utf-8'}
        ses = requests.session()
        req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", data = payload)
        print(req.request.body)
        page = req.text
        soup = BeautifulSoup(page)
        # find next __VIEWSTATE and __EVENTVALIDATION
        viewstate = soup.find("input", {"id":"__VIEWSTATE"})['value']
        print("VIEWSTATE: ", viewstate)
        eventval = soup.find("input", {"id":"__EVENTVALIDATION"})['value']
        print("EVENTVALIDATION: ", eventval)        
        payload['__EVENTTARGET'] = ""
        payload['__EVENTARGUMENT'] = ""
        payload['__VIEWSTATE'] = viewstate
        payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
        payload['__EVENTVALIDATION'] = eventval
        payload['ct100$Contenido$GoPag'] = i + 1
        table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
        for row in table.findAll("tr")[1:]:
            cells = row.findAll("td")
            print(cells[0].find("a").get_text().replace(",","").replace("S.A.", ""))
        time.sleep(1)


SPAIN_STK_LIST(3)

每个页面都会生成一个VIEWSTATE和EVENTVALIDATION数字,这些数字存储在页面的隐藏标签中.我用它们去下一页.我还按照建议使用了会话,但这仍然无法正常工作.但是,它具有与网页中的请求正文完全相同的格式(我使用ordereddict).有什么想法会丢失吗?

Each page generates an VIEWSTATE and EVENTVALIDATION numbers that are stored in hidden tags on the page. I use them to go to the next page. I also used session as suggested but this is still not working. However it has the exact same format (I used ordereddict) as the request body from the webpage. Any ideas what would be missing?

推荐答案

我不知道您是否可以使用Selenium,但是如果您要与页面进行交互,则应该使用.

I don't know if you can use Selenium but if you are going to interact with page you should.

您可以使用pip

我只将熊猫用于可视化目的.您不必使用它.

I used pandas for only visualisation purposes. You do not have to use it.

首先从此处下载用于硒的chrome驱动程序 https://chromedriver.storage.googleapis.com/index.html?path = 2.40/ 并将其解压缩到您的工作空间中,或者您可以在execute_path参数中指定它.完全由您决定.

First download chrome driver for selenium from here https://chromedriver.storage.googleapis.com/index.html?path=2.40/ and extract it to your workspace or you can specify it in executable_path parameter.It's up to you.

这将获取表中的所有数据,直到没有下一页为止.

This will get all the data in the table until there is no next page.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd

options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"chromedriver.exe",options=options)

driver.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx")
next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
data = []
try:
    while (next_button):    
        soup = BeautifulSoup(driver.page_source,'html.parser')
        table = soup.find('table',{'id':'ctl00_Contenido_tblEmisoras'})
        table_body = table.find('tbody')
        rows = table_body.find_all('tr')
        for row in rows:            
            cols = row.find_all('td')
            cols = [ele.text.strip() for ele in cols]                
            data.append([ele for ele in cols])        
        #Wait for table to load
        time.sleep(2)
        next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
        next_button.click()
except NoSuchElementException:
    print('No more page to load')

df = pd.DataFrame(columns= ('Name','Sector - Subsector','Market','Indices'), data = data)

print(df.mask(df.eq('None')).dropna())

输出为

 Name                        ...                                                                    Indices
1                               ABENGOA, S.A.                        ...
2              ABERTIS INFRAESTRUCTURAS, S.A.                        ...
3                                ACCIONA,S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
4                              ACERINOX, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
5    ACS,ACTIVIDADES DE CONST.Y SERVICIOS S.A                        ...                                              IBEX 35®, IBEX TOP Dividendo®
6                      ADOLFO DOMINGUEZ, S.A.                        ...
7             ADVEO GROUP INTERNATIONAL, S.A.                        ...
8                           AEDAS HOMES, S.A.                        ...
9                          AENA, S.M.E., S.A.                        ...                                                                   IBEX 35®
10                                  AIRBUS SE                        ...
11                     ALANTRA PARTNERS, S.A.                        ...
12                       ALFA, S.A.B. DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
13                             ALMIRALL, S.A.                        ...
14                     AMADEUS IT GROUP, S.A.                        ...                                                                   IBEX 35®
15              AMERICA MOVIL, S.A.B. DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
16                                AMPER, S.A.                        ...
17                    APERAM, SOCIETE ANONYME                        ...
18                      APPLUS SERVICES, S.A.                        ...
19                        ARCELORMITTAL, S.A.                        ...                                                                   IBEX 35®
20    ATRESMEDIA CORP. DE MEDIOS DE COM. S.A.                        ...                                                        IBEX TOP Dividendo®
21                     AUDAX RENOVABLES, S.A.                        ...
22             AXIARE PATRIMONIO SOCIMI, S.A.                        ...
23              AYCO GRUPO INMOBILIARIO, S.A.                        ...
24                               AZKOYEN S.A.                        ...
25                          AZORA ALTUS, S.A.                        ...
27      BANCO BILBAO VIZCAYA ARGENTARIA, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
28                        BANCO BRADESCO S.A.                        ...                          FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
29                    BANCO DE SABADELL, S.A.                        ...                                                                   IBEX 35®
30                  BANCO SANTANDER RIO, S.A.                        ...                                                     FTSE Latibex All Share
31                      BANCO SANTANDER, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
..                                        ...                        ...                                                                        ...
145                       RENTA 4 BANCO, S.A.                        ...
146       RENTA CORPORACION REAL ESTATE, S.A.                        ...
147                              REPSOL, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
148                               SACYR, S.A.                        ...
149                         SAETA YIELD, S.A.                        ...
150              SARE HOLDING, S.A.B, DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
151             SERVICE POINT SOLUTIONS, S.A.                        ...
152     SIEMENS GAMESA RENEWABLE ENERGY, S.A.                        ...                                                                   IBEX 35®
153                              SNIACE, S.A.                        ...
154    SOLARIA ENERGIA Y MEDIO AMBIENTE, S.A.                        ...
155                               TALGO, S.A.                        ...
157                   TECNICAS REUNIDAS, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
158                          TELEFONICA, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
159                     TELEPIZZA GROUP, S.A.                        ...
160             TR HOTEL JARDIN DEL MAR, S.A.                        ...
161                             TUBACEX, S.A.                        ...
162                       TUBOS REUNIDOS,S.A.                        ...
163                   TV AZTECA, S.A. DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
164                       UNICAJA BANCO, S.A.                        ...
165           UNION CATALANA DE VALORES, S.A.                        ...
166                    URBAR INGENIEROS, S.A.                        ...
167              URBAS GRUPO FINANCIERO, S.A.                        ...
168  USINAS SIDERURGICAS DE MINAS GERAIS,S.A.                        ...                          FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
169                                VALE, S.A.                        ...                          FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
170  VERTICE TRESCIENTOS SESENTA GRADOS, S.A.                        ...
171                              VIDRALA S.A.                        ...
172                            VISCOFAN, S.A.                        ...                                                                   IBEX 35®
173                             VOCENTO, S.A.                        ...
174            VOLCAN, COMPAñIA MINERA S.A.A.                        ...                                                     FTSE Latibex All Share
175                        ZARDOYA OTIS, S.A.                        ...

[169 rows x 4 columns]

这篇关于Python Requests/BeautifulSoup访问分页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆