我的脚本似乎没有转到下一页,也没有抓取我想要的所有数据 [英] My script doesn't seems to go to the next pages and doesn't scrape all the data I would like
问题描述
这是我的脚本(为了清楚起见,我没有列出所有代码,但我会详细解释某些方面):
from selenium import webdriver导入时间从 selenium.webdriver.support.select 导入选择从 selenium.webdriver.support.ui 导入 WebDriverWaitfrom selenium.webdriver.common.by import By从 selenium.webdriver.support 导入 expected_conditions 作为 EC进口请求从 bs4 导入 BeautifulSoup将熊猫导入为 pd将 numpy 导入为 np从 selenium.webdriver.common.keys 导入密钥#chemin du folder ou vous avez placer votre chromedriverPATH = "驱动程序\chromedriver.exe";选项 = webdriver.ChromeOptions()options.add_argument("--disable-gpu")options.add_argument("--window-size=1200,900")options.add_argument('启用日志')j = 3对于 url, name in zip(urlfinal, nameshotel) :comspos = []通讯 = []标题 = []注释 = []日期 = []住宿日期 = []driver.get(url)结果 = requests.get(url, headers = headers)汤 = BeautifulSoup(results.text, html.parser")评论 = 汤.find_all('li', class_ = "review_list_new_item_block")for k in range(j): #iterate over n pages在评论中查看:尝试:commpos = review.find("div", class_ = "c-review__row").text[11:].strip()除了:组合 = 'NA'commspos.append(commpos)尝试:commneg = review.find("div", class_ = "c-review__row lalala").text[17:].strip()除了:commneg = 'NA'commsneg.append(commneg)#head = review.find('div', class_ = ' c-review-block__title c-review__title--ltr ').text.strip()#header.append(head)note = review.find('div', class_ = 'bui-review-score__badge').text.strip()notes.append(note)date = review.find('span', class_ = 'c-review-block__date').text.strip()日期.附加(日期)尝试:datestay = review.find('ul', class_ = 'bui-list bui-list--text bui-list--icon bui_font_caption c-review-block__row c-review-block__stay-date').text[16:].跳闸()datetostay.append(datestay)除了:datetostay.append('NaN')时间.sleep(3)nextpages = driver.find_element_by_xpath('//a[@class="pagenext"]')urlnext = nextpages.get_attribute("href")结果2 = requests.get(urlnext)driver.get(urlnext)时间.sleep(3)汤 = BeautifulSoup(results2.text, html.parser")评论 = 汤.find_all('li', class_ = "review_list_new_item_block")数据 = pd.DataFrame({'commspos' : commspos,'commsneg' : commsneg,#'headers' : 标题,'笔记':笔记,日期":日期,'datestostay' : 住宿日期,})data.to_csv(f"{name}.csv", sep=';', index=False, encoding = 'utf_8_sig')#data.to_csv(f"{name} + datetime.now().strftime("_%Y_%m_%d-%I_%M_%S").csv", sep=';', index=False)时间.sleep(3)
酒店链接列表存储在 urlfinal
中,例如:
然而,似乎这个 href
转到下一页,我将这个 href 实现到一个循环中,如上所示:
为什么它不能按预期工作的任何想法?
您需要更改 URL 部分 row=25 获取 HTML 中的所有行
导入请求headers={User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}res=requests.get("https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=标题)汤 = BeautifulSoup(res.text, html.parser")评论 = 汤.find_all('li', class_ = "review_list_new_item_block")
输出:
len(评论)25
上面的代码是一个,但下面的代码是 61 页,因为我找到了第一页和最后一页 offset
值,并基于它提取评论
导入请求def find_page_val():headers={User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=标题)汤 = BeautifulSoup(res.text, html.parser")first=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[1].find("a")['href'].split("=")[-1])last=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[-1].find("a";)['href'].split("=")[-1])总计=第一个+最后一个返回第一,总定义连接(i):headers={User-Agent": Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset={i}&rows=25";,headers=headers)汤 = BeautifulSoup(res.text, html.parser")还汤def get_list_reviews():首先,总计=find_page_val()对于范围内的 i (0,total,first):汤=连接(i)评论 =soup.find_all('li', class_ = "review_list_new_item_block")打印(len(评论))
在最后一次调用 get_list_reviews()
这个函数它给出了输出
输出:
252525...
Here's my script (I didn't put all the code for the sake of clarity but I will explained in details some aspect) :
from selenium import webdriver
import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from selenium.webdriver.common.keys import Keys
#chemin du folder ou vous avez placer votre chromedriver
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
j = 3
for url, name in zip(urlfinal, nameshotel) :
commspos = []
commsneg = []
header = []
notes = []
dates = []
datestostay = []
driver.get(url)
results = requests.get(url, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
reviews = soup.find_all('li', class_ = "review_list_new_item_block")
for k in range(j): #iterate over n pages
for review in reviews:
try:
commpos = review.find("div", class_ = "c-review__row").text[11:].strip()
except:
commpos = 'NA'
commspos.append(commpos)
try:
commneg = review.find("div", class_ = "c-review__row lalala").text[17:].strip()
except:
commneg = 'NA'
commsneg.append(commneg)
#head = review.find('div', class_ = ' c-review-block__title c-review__title--ltr ').text.strip()
#header.append(head)
note = review.find('div', class_ = 'bui-review-score__badge').text.strip()
notes.append(note)
date = review.find('span', class_ = 'c-review-block__date').text.strip()
dates.append(date)
try:
datestay = review.find('ul', class_ = 'bui-list bui-list--text bui-list--icon bui_font_caption c-review-block__row c-review-block__stay-date').text[16:].strip()
datestostay.append(datestay)
except:
datestostay.append('NaN')
time.sleep(3)
nextpages = driver.find_element_by_xpath('//a[@class="pagenext"]')
urlnext = nextpages.get_attribute("href")
results2 = requests.get(urlnext)
driver.get(urlnext)
time.sleep(3)
soup = BeautifulSoup(results2.text, "html.parser")
reviews = soup.find_all('li', class_ = "review_list_new_item_block")
data = pd.DataFrame({
'commspos' : commspos,
'commsneg' : commsneg,
#'headers' : header,
'notes' : notes,
'dates' : dates,
'datestostay' : datestostay,
})
data.to_csv(f"{name}.csv", sep=';', index=False, encoding = 'utf_8_sig')
#data.to_csv(f"{name} + datetime.now().strftime("_%Y_%m_%d-%I_%M_%S").csv", sep=';', index=False)
time.sleep(3)
A list of hotel links is stored in urlfinal
like this one for example : link
And nameshotel
is just a list of name and this list is used just for creation of csv files, doesn't matter too much.
I cannot figure it out why but this part doesn't seems to work :
nextpages = driver.find_element_by_xpath('//a[@class="pagenext"]')
urlnext = nextpages.get_attribute("href")
results2 = requests.get(urlnext)
driver.get(urlnext)
time.sleep(3)
soup = BeautifulSoup(results2.text, "html.parser")
reviews = soup.find_all('li', class_ = "review_list_new_item_block")
My script scrape only 20 comments, for each links, there are 25 comments in each pages :
Yet, it seems that this href
go to the next pages and I implemented this href into a loop as you can see above :
Any ideas why it doesn't work as intended ?
You need to change your URL part row=25 get all rows in HTML
import requests
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
res=requests.get("https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
reviews = soup.find_all('li', class_ = "review_list_new_item_block")
Output:
len(reviews)
25
Above code is for one but below code is for 61 pages for that i have find out first page and last page offset
value and based on that it extract reviews
import requests
def find_page_val():
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset=0&rows=25",headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
first=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[1].find("a")['href'].split("=")[-1])
last=int(soup.find("div", class_="bui-pagination__pages").find_all("div",class_="bui-pagination__item")[-1].find("a")['href'].split("=")[-1])
total=first+last
return first,total
def connection(i):
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
res=requests.get(f"https://www.booking.com/reviewlist.fr.html?cc1=fr&dist=1&pagename=hotelistria&type=total&offset={i}&rows=25",headers=headers)
soup = BeautifulSoup(res.text, "html.parser")
return soup
def get_list_reviews():
first,total=find_page_val()
for i in range(0,total,first):
soup=connection(i)
reviews =soup.find_all('li', class_ = "review_list_new_item_block")
print(len(reviews))
in last call get_list_reviews()
this function it gives ouptut
Output:
25
25
25
...
这篇关于我的脚本似乎没有转到下一页,也没有抓取我想要的所有数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!