使用 Selenium 抓取网页时阻止登录覆盖窗口 [英] Blocking login overlay window when scraping web page using Selenium
问题描述
我试图在 10 个网页中抓取一长串书籍.当循环第一次点击 next > 按钮时,网站会显示登录覆盖,因此 selenium 无法找到目标元素.我已经尝试了所有可能的解决方案:
I am trying to scrape a long list of books in 10 web pages. When the loop clicks on next > button for the first time the website displays a login overlay so selenium can not find the target elements. I have tried all the possible solutions:
- 使用一些镀铬选项.
- 使用 try-except 点击叠加层上的 X 按钮.但它只出现一次(第一次点击 next > 时).问题是,当我将这个 try-except 块放在
while True:
循环的末尾时,它变得无限,因为我使用continue
in except 因为我不想打破循环. - 向 Chrome 添加一些弹出窗口阻止程序扩展程序,但在我运行代码时它们不起作用,尽管我使用
options.add_argument('load-extension=' + ExtensionPath)
添加了扩展程序.
- Use some chrome options.
- Use try-except to click X button on the overlay. But it appears only one time (when clicking next > for the first time). The problem is that when I put this try-except block at the end of
while True:
loop, it became infinite as I usecontinue
in except as I do not want to break the loop. - Add some popup blocker extensions to Chrome but they do not work when I run the code although I add the extension using
options.add_argument('load-extension=' + ExtensionPath)
.
这是我的代码:
options = Options()
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('disable-avfoundation-overlays')
options.add_argument('disable-internal-flash')
options.add_argument('no-proxy-server')
options.add_argument("disable-notifications")
options.add_argument("disable-popup")
Extension = (r'C:\Users\DELL\AppData\Local\Google\Chrome\User Data\Profile 1\Extensions\ifnkdbpmgkdbfklnbfidaackdenlmhgh\1.1.9_0')
options.add_argument('load-extension=' + Extension)
options.add_argument('--disable-overlay-scrollbar')
driver = webdriver.Chrome(options=options)
driver.get('https://www.goodreads.com/list/show/32339._50_?page=')
wait = WebDriverWait(driver, 2)
review_dict = {'title':[], 'author':[],'rating':[]}
html_soup = BeautifulSoup(driver.page_source, 'html.parser')
prod_containers = html_soup.find_all('table', class_ = 'tableList js-dataTooltip')
while True:
table = driver.find_element_by_xpath('//*[@id="all_votes"]/table')
for product in table.find_elements_by_xpath(".//tr"):
for td in product.find_elements_by_xpath('.//td[3]/a'):
title = td.text
review_dict['title'].append(title)
for td in product.find_elements_by_xpath('.//td[3]/span[2]'):
author = td.text
review_dict['author'].append(author)
for td in product.find_elements_by_xpath('.//td[3]/div[1]'):
rating = td.text[0:4]
review_dict['rating'].append(rating)
try:
close = wait.until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[3]/div/div/div[1]/button')))
close.click()
except NoSuchElementException:
continue
try:
element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'next_page')))
element.click()
except TimeoutException:
break
df = pd.DataFrame.from_dict(review_dict)
df
任何帮助,例如我是否可以将循环更改为 for 循环点击 next > 按钮直到结束而不是 while 循环,或者我应该在哪里放置 try-except 块以关闭叠加层,或者如果有Chromeoption 可以禁用叠加层.提前致谢
Any help like if I can change the loop to for loop clicks next > button until the end rather than while loop or where should I put try-except block to close the overlay or if there is Chromeoption can disable overlay. Thanks in advance
推荐答案
感谢您分享您的代码和遇到问题的网站.我能够使用 xpath
关闭登录模式.我接受了这个挑战并使用类对象分解了代码.
1 个对象用于 selenium.webdriver.chrome.webdriver
,另一个对象用于您想要抓取数据的页面 ( https://www.goodreads.com/list/show/32339).
在以下方法中,我使用了 Javascript return arguments[0].scrollIntoView();
方法,并且能够滚动到页面上显示的最后一本书.完成后,我可以点击下一步按钮
Thank you for sharing your code and the website that you are having trouble with. I was able to close the Login Modal by using xpath
. I took this challenge and broke up the code using class objects.
1 object is for the selenium.webdriver.chrome.webdriver
and the other object is for the page that you wanted to scrape the data against ( https://www.goodreads.com/list/show/32339 ).
In the following methods, I used the Javascript return arguments[0].scrollIntoView();
method and was able to scroll to the last book that displayed on the page. After I did that, I was able to click the next button
def scroll_to_element(self, xpath : str):
element = self.chrome_driver.find_element(By.XPATH, xpath)
self.chrome_driver.execute_script("return arguments[0].scrollIntoView();", element)
def get_book_count(self):
return self.chrome_driver.find_elements(By.XPATH, "//div[@id='all_votes']//table[contains(@class, 'tableList')]//tbody//tr").__len__()
def click_next_page(self):
# Scroll to last record and click "next page"
xpath = "//div[@id='all_votes']//table[contains(@class, 'tableList')]//tbody//tr[{0}]".format(self.get_book_count())
self.scroll_to_element(xpath)
self.chrome_driver.find_element(By.XPATH, "//div[@id='all_votes']//div[@class='pagination']//a[@class='next_page']").click()
一旦我点击了下一步"按钮,我看到了模态显示.我能够找到模态的 xpath
并能够关闭模态.
Once I clicked on the "Next" button, I saw the modal display. I was able to find the xpath
for the modal and was able to close the modal.
def is_displayed(self, xpath: str, int = 5):
try:
webElement = DriverWait(self.chrome_driver, int).until(
DriverConditions.presence_of_element_located(locator = (By.XPATH, xpath))
)
return True if webElement != None else False
except:
return False
def is_modal_displayed(self):
return self.is_displayed("//body[@class='modalOpened']")
def close_modal(self):
self.chrome_driver.find_element(By.XPATH, "//div[@class='modal__content']//div[@class='modal__close']").click()
if(self.is_modal_displayed()):
raise Exception("Modal Failed To Close")
希望这能帮助您解决问题.
I hope this helps you to solve your problem.
这篇关于使用 Selenium 抓取网页时阻止登录覆盖窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!