Python Webscraping Selenium和BeautifulSoup(模态窗口内容) [英] Python Webscraping Selenium and BeautifulSoup (Modal window content)

查看:255
本文介绍了Python Webscraping Selenium和BeautifulSoup(模态窗口内容)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习网络爬虫(我是一个新手).我注意到在某些网站上(例如Quora),当我单击按钮并在屏幕上出现一个新元素时.我似乎无法获得新元素的页面源.我希望能够获取新弹出窗口的页面源并获取所有元素.请注意,您需要拥有Quora帐户才能了解我的问题.

我有一部分代码可供您使用beautifulsoup,Selenium和chromedriver使用:

I am trying to learn webscraping (I am a total novice). I noticed that on some websites (for eg. Quora), when I click a button and a new element comes up on screen. I cannot seem to get the page source of the new element. I want to be able to get the page source of the new popup and get all the elements. Note that you need to have a Quora account in order to understand my problem.

I have a part of a code that you can use using beautifulsoup, selenium and chromedriver:

from selenium import webdriver
from bs4 import BeautifulSoup
from unidecode import unidecode 
import time

sleep = 10
USER_NAME = 'Insert Account name' #Insert Account name here
PASS_WORD = 'Insert Account Password' #Insert Account Password here
url = 'Insert url' 
url2 = ['insert url']
#Logging in to your account
driver = webdriver.Chrome('INSERT PATH TO CHROME DRIVER')
driver.get(url)
page_source=driver.page_source
if 'Continue With Email' in page_source:
    try:
        username = driver.find_element(By.XPATH, '//input[@placeholder="Email"]')
        password = driver.find_element(By.XPATH, '//input[@placeholder="Password"]')
        login= driver.find_element(By.XPATH, '//input[@value="Login"]')
        username.send_keys(USER_NAME)
        password.send_keys(PASS_WORD)
        time.sleep(sleep)
        login.click()
        time.sleep(sleep)
    except:
        print ('Did not work :( .. Try again')
else:
    print ('Did not work :( .. Try different page')


下一部分将转到相关网页,并(尝试")收集有关特定问题关注者的信息.


Next part will go to the concerned webpage and ("try to") collect information about the followers of a particular question.

for url1 in url2:        
    driver.get(url1)
    source = driver.page_source
    soup1 = BeautifulSoup(source,"lxml")  
    Follower_button = soup1.find('a',{'class':'FollowerListModalLink QuestionFollowerListModalLink'})
    Follower_button2 = unidecode(Follower_button.text)
    driver.find_element_by_link_text(Follower_button2).click()

####Does not gives me correct page source in the next line####
    source2=driver.page_source
    soup2=BeautifulSoup(source2,"lxml")

    follower_list = soup2.findAll('div',{'class':'FollowerListModal QuestionFollowerListModal Modal'})
    if len(follower_list)>0:
        print 'It worked :)'
    else:
        print 'Did not work :('

但是,当我尝试获取followers元素的页面源时,最终还是获得了主页而不是follower元素的页面源.谁能帮助我获取弹出的Follower元素的页面源代码?我什么都不来.

However when I try to get the page source of the followers element, I end up getting the page source of the main page rather than the follower element. Can anyone help me to get the page source of the follower element that pops up?? What am I not getting here.

注意: 重新创建或查看我的问题的另一种方法是登录到您的Quora帐户(如果有),然后与关注者讨论任何问题.如果单击屏幕右下角的关注者按钮,将弹出一个窗口.我的问题本质上是获取此弹出窗口的元素.

NOTE: Another way of recreating or looking at my problem is to log in to your Quora account (if you have one) and then go to any question with followers. If you click the followers button on the lower right side of the screen, that will result in a popup. My problem is essentially to get the elements of this popup.


更新- 好的,我已经阅读了一些,似乎该窗口是一个模态窗口.有人帮我获取模态窗口的内容吗?

Update - Okay so I have been reading a bit and it seems like the window is a modal window. Does anyone help me with getting contents of a modal window?

推荐答案

问题已解决.我要做的就是添加一行:

Problem resolved. All I had to do was to add one line:

time.sleep(sleep_time)

产生点击后.问题是因为最初没有等待时间,页面源没有得到更新.但是,随着time.sleep足够长的时间(可能因网站而异),页面源终于得到了更新,并且我能够获得所需的元素. :) 学习到教训了.耐心是刮网的关键.花了整整一天的时间试图弄清楚这一点.

after generating the click. The problem was because there was no wait time initially, the page source was not getting updated. However with time.sleep sufficiently long (may vary from website to website), the page source finally got updated and I was able to get the required elements. :) Lesson learnt. Patience is the key to web scraping. Spent the entire day trying to figure this out.

这篇关于Python Webscraping Selenium和BeautifulSoup(模态窗口内容)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆