使用BeautifulSoup和Selenium抓取网站的多个网页的内容 [英] Scraping contents of multi web pages of a website using BeautifulSoup and Selenium
问题描述
我要剪贴的网站是:
http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061
我要获取继续链接上方的最后一页编号,即截屏时的499.
I want to get the last page number of the above the link for proceeding, which is 499 while taking the screenshot.
我的代码:
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
from selenium import webdriver;import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['binary'] = '/etc/firefox'
driver = webdriver.Firefox(capabilities=firefox_capabilities)
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
driver.get(url)
wait = WebDriverWait(driver, 10)
soup=BeautifulSoup(driver.page_source,"lxml")
containers = soup.findAll("ul",{"class":"pages table"})
containers[0] = soup.findAll("li")
li_len = len(containers[0])
for item in soup.find("ul",{"class":"pages table"}) :
li_text = item.select("li")[li_len].text
print("li_text : {}\n".format(li_text))
driver.quit()
我需要帮助找出我的代码中的错误以获取最后的页码.另外,如果有人能提供相同的替代解决方案并提出实现我的意图的方法,我将不胜感激.
I need help to figure out the error in my code for getting the last page number. Also, I would be grateful if someone give the alternate solution for the same and suggest ways to achieve my intention.
推荐答案
如果要获取上述链接的最后一页编号,该链接为 499
,则可以使用以下任一方法 Selenium
或 Beautifulsoup
如下:
If you want to get the last page number of the above the link for proceeding, which is 499
you can use either Selenium
or Beautifulsoup
as follows :
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
driver.get(url)
element = driver.find_element_by_xpath("//div[@class='row pagination']//p/span[contains(.,'Reviews on Reliance Jio')]")
driver.execute_script("return arguments[0].scrollIntoView(true);", element)
print(driver.find_element_by_xpath("//ul[@class='pagination table']/li/ul[@class='pages table']//li[last()]/a").get_attribute("innerHTML"))
driver.quit()
控制台输出:
499
Beautifulsoup:
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = "http://www.mouthshut.com/mobile-operators/Reliance-Jio-reviews-925812061"
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.find("ul",{"class":"pages table"})
all_li = container.findAll("li")
last_div = None
for last_div in all_li:pass
if last_div:
content = last_div.getText()
print(content)
控制台输出:
499
这篇关于使用BeautifulSoup和Selenium抓取网站的多个网页的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!