我的脚本似乎没有转到下一页,也没有抓取我想要的所有数据 [英] My script doesn't seems to go to the next pages and doesn't scrape all the data I would like

查看:12
本文介绍了我的脚本似乎没有转到下一页,也没有抓取我想要的所有数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的脚本(为了清楚起见,我没有列出所有代码,但我会详细解释某些方面):

from selenium import webdriver导入时间从 selenium.webdriver.support.select 导入选择从 selenium.webdriver.support.ui 导入 WebDriverWaitfrom selenium.webdriver.common.by import By从 selenium.webdriver.support 导入 expected_conditions 作为 EC进口请求从 bs4 导入 BeautifulSoup将熊猫导入为 pd将 numpy 导入为 np从 selenium.webdriver.common.keys 导入密钥#chemin du folder ou vous avez placer votre chromedriverPATH = "驱动程序\chromedriver.exe";选项 = webdriver.ChromeOptions()options.add_argument("--disable-gpu")options.add_argument("--window-size=1200,900")options.add_argument('启用日志')j = 3对于 url, name in zip(urlfinal, nameshotel) :comspos = []通讯 = []标题 = []注释 = []日期 = []住宿日期 = []driver.get(url)结果 = requests.get(url, headers = headers)汤 = BeautifulSoup(results.text, html.parser")评论 = 汤.find_all('li', class_ = "review_list_new_item_block")for k in range(j): #iterate over n pages在评论中查看:尝试:commpos = review.find("div", class_ = "c-review__row").text[11:].strip()除了:组合 = 'NA'commspos.append(commpos)尝试:commneg = review.find("div", class_ = "c-review__row lalala").text[17:].strip()除了:commneg = 'NA'commsneg.append(commneg)#head = review.find('div', class_ = ' c-review-block__title c-review__title--ltr ').text.strip()#header.append(head)note = review.find('div', class_ = 'bui-review-score__badge').text.strip()notes.append(note)date = review.find('span', class_ = 'c-review-block__date').text.strip()日期.附加(日期)尝试:datestay = review.find('ul', class_ = 'bui-list bui-list--text bui-list--icon bui_font_caption c-review-block__row c-review-block__stay-date').text[16:].跳闸()datetostay.append(datestay)除了:datetostay.append('NaN')时间.sleep(3)nextpages = driver.find_element_by_xpath('//a[@class="pagenext"]')urlnext = nextpages.get_attribute("href")结果2 = requests.get(urlnext)driver.get(urlnext)时间.sleep(3)汤 = BeautifulSoup(results2.text, html.parser")评论 = 汤.find_all('li', class_ = "review_list_new_item_block")数据 = pd.DataFrame({'commspos' : commspos,'commsneg' : commsneg,#'headers' : 标题,'笔记':笔记,日期":日期,'datestostay' : 住宿日期,})data.to_csv(f"{name}.csv", sep=';', index=False, encoding = 'utf_8_sig')#data.to_csv(f"{name} + datetime.now().strftime("_%Y_%m_%d-%I_%M_%S").csv", sep=';', index=False)时间.sleep(3)

酒店链接列表存储在 urlfinal 中,例如:

然而,似乎这个 href 转到下一页,我将这个 href 实现到一个循环中,如上所示:

为什么它不能按预期工作的任何想法?

解决方案

您需要更改 URL 部分 row=25 获取 HTML 中的所有行

导入请求headers={User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}res=requests.get("https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=标题)汤 = BeautifulSoup(res.text, html.parser")评论 = 汤.find_all('li', class_ = "review_list_new_item_block")

输出:

len(评论)25

上面的代码是一个,但下面的代码是 61 页,因为我找到了第一页和最后一页 offset 值,并基于它提取评论

导入请求def find_page_val():headers={User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=标题)汤 = BeautifulSoup(res.text, html.parser")first=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[1].find("a")['href'].split("=")[-1])last=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[-1].find("a";)['href'].split("=")[-1])总计=第一个+最后一个返回第一,总定义连接(i):headers={User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset={i}&rows=25";,headers=headers)汤 = BeautifulSoup(res.text, html.parser")还汤def get_list_reviews():首先,总计=find_page_val()对于范围内的 i (0,total,first):汤=连接(i)评论 =soup.find_all('li', class_ = "review_list_new_item_block")打印(len(评论))

在最后一次调用 get_list_reviews() 这个函数它给出了输出

输出:

252525...

Here's my script (I didn't put all the code for the sake of clarity but I will explained in details some aspect) :

from selenium import webdriver
import time   
from selenium.webdriver.support.select import Select    
from selenium.webdriver.support.ui import WebDriverWait     
from selenium.webdriver.common.by import By     
from selenium.webdriver.support import expected_conditions as EC   
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np    
from selenium.webdriver.common.keys import Keys

#chemin du folder ou vous avez placer votre chromedriver
PATH = "driver\chromedriver.exe"

options = webdriver.ChromeOptions() 
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')

j = 3
for url, name in zip(urlfinal, nameshotel) :

    commspos = []
    commsneg = []
    header = []
    notes = []
    dates = []
    datestostay = []

    driver.get(url)

    results = requests.get(url, headers = headers)

    soup = BeautifulSoup(results.text, "html.parser")

    reviews = soup.find_all('li', class_ = "review_list_new_item_block")

    for k in range(j): #iterate over n pages


        for review in reviews:
            try:
                commpos  = review.find("div", class_  = "c-review__row").text[11:].strip()
            except:
                commpos = 'NA'

            commspos.append(commpos)



            try:
                commneg  = review.find("div", class_  = "c-review__row lalala").text[17:].strip()
            except:
                commneg = 'NA'

            commsneg.append(commneg)


            #head = review.find('div', class_ = ' c-review-block__title c-review__title--ltr  ').text.strip()
            #header.append(head)


            note = review.find('div', class_ = 'bui-review-score__badge').text.strip()
            notes.append(note)


            date = review.find('span', class_ = 'c-review-block__date').text.strip()
            dates.append(date)


            try:
                datestay = review.find('ul', class_ = 'bui-list bui-list--text bui-list--icon bui_font_caption c-review-block__row c-review-block__stay-date').text[16:].strip()
                datestostay.append(datestay)
            except:
                datestostay.append('NaN')

            time.sleep(3)

        nextpages = driver.find_element_by_xpath('//a[@class="pagenext"]')

        urlnext = nextpages.get_attribute("href")

        results2 = requests.get(urlnext)

        driver.get(urlnext)

        time.sleep(3)

        soup = BeautifulSoup(results2.text, "html.parser")

        reviews = soup.find_all('li', class_ = "review_list_new_item_block")


    data = pd.DataFrame({
        'commspos' : commspos,
        'commsneg' : commsneg,
        #'headers' : header,
        'notes' : notes,
        'dates' : dates,
        'datestostay' : datestostay,
        })

    data.to_csv(f"{name}.csv", sep=';', index=False, encoding = 'utf_8_sig')
    #data.to_csv(f"{name} + datetime.now().strftime("_%Y_%m_%d-%I_%M_%S").csv", sep=';', index=False)

    time.sleep(3)

A list of hotel links is stored in urlfinal like this one for example : link

And nameshotel is just a list of name and this list is used just for creation of csv files, doesn't matter too much.

I cannot figure it out why but this part doesn't seems to work :

nextpages = driver.find_element_by_xpath('//a[@class="pagenext"]')

urlnext = nextpages.get_attribute("href")

results2 = requests.get(urlnext)

driver.get(urlnext)

time.sleep(3)

soup = BeautifulSoup(results2.text, "html.parser")

reviews = soup.find_all('li', class_ = "review_list_new_item_block")

My script scrape only 20 comments, for each links, there are 25 comments in each pages :

Yet, it seems that this href go to the next pages and I implemented this href into a loop as you can see above :

Any ideas why it doesn't work as intended ?

解决方案

You need to change your URL part row=25 get all rows in HTML

import requests
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
res=requests.get("https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=headers)

soup = BeautifulSoup(res.text, "html.parser")

reviews = soup.find_all('li', class_ = "review_list_new_item_block")

Output:

len(reviews)

25

Above code is for one but below code is for 61 pages for that i have find out first page and last page offset value and based on that it extract reviews

import requests



def find_page_val():
    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
    res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=headers)
    soup = BeautifulSoup(res.text, "html.parser")
    first=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[1].find("a")['href'].split("=")[-1])
    last=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[-1].find("a")['href'].split("=")[-1])
    total=first+last
    return first,total
    
def connection(i):
    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
    res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset={i}&rows=25",headers=headers)
    soup = BeautifulSoup(res.text, "html.parser")
    return soup


def get_list_reviews():
    first,total=find_page_val()
    for i in range(0,total,first):
        soup=connection(i)
        reviews =soup.find_all('li', class_ = "review_list_new_item_block")
        print(len(reviews))

in last call get_list_reviews() this function it gives ouptut

Output:

25
25
25
...

这篇关于我的脚本似乎没有转到下一页,也没有抓取我想要的所有数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆