Scrapy 与 Selenium 不检测动态加载的 HTML 元素 [英] Scrapy with Selenium does not detect HTML element loaded dynamically

查看:57
本文介绍了Scrapy 与 Selenium 不检测动态加载的 HTML 元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Scrapy 和 Selenium 来从这个页面抓取内容:https://nikmikk.itch.io/门铃

其中,div下有一个.game_info_panel_widget类的表格,第一行Published 62 days ago好像是动态加载的.

我尝试像 Scrapy 一样获取页面,但在 html 中找不到该行.

scrapy fetch --nolog https://nikmikk.itch.io/door-knocker >测试.html

这是我在 test.html 中看到的,第一个表格行是状态,而不是像我直接在 Chrome 中查看页面源时那样的已发布行.

<表格><tr><td>状态</td><td>原型</td>...</tr>...

在我的类 SpiderDownloaderMiddleware 中,我包含了 Selenium:

options = webdriver.ChromeOptions()options.add_argument('headless')options.add_argument('window-size=1200x600')驱动程序 = webdriver.Chrome(chrome_options=options)类 SpiderDownloaderMiddleware(对象):# 省略其他代码def process_request(self, request, spider):driver.get(request.url)WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".game_info_panel_widget")))body = driver.page_source返回 HtmlResponse(driver.current_url, body=body, encoding='utf-8-sig', request=request)

如何检查该行的加载方式以及如何抓取这些信息?

更新:我按照@Yosuva A 在下面的回答得到了这样的结果:

 9 天前开发中平台视窗评分(9)作者大卫克拉克类型生存, 解谜标签3D, 恐怖, 第一人称视角, 恐怖, 心理恐怖, 短片, 单人, 阴森, 团结平均会话几秒钟语言英语

但是输出不一致,有时它给出了想要的,有时却没有.我猜是因为 Selenium 等待通用的 td 元素,这很常见:

"//div[@class='game_info_panel_widget']//table//tr//td"

我尝试修改为使用 td[@text='Published'] 但 Selenium 超时.

我的代码:

from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECdriver = webdriver.Chrome('chromedriver') # 可选参数,如果不指定将搜索路径.driver.implicitly_wait(15)driver.get("https://thehive.itch.io/promnesia");driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']//table//tr//td"))) #等待特定元素table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']//table//tr//td")对于 table_rows 中的行:打印(行.文本)驱动程序退出()

还有其他方法吗?

结论:如果我们按照 Yosuva A 的建议在 click() 之后 time.sleep(2) ,它会起作用.

解决方案

请让我知道这是否有帮助

from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECdriver = webdriver.Chrome('/usr/local/bin/chromedriver') # 可选参数,如果不指定将搜索路径.driver.implicitly_wait(15)driver.get("https://thehive.itch.io/promnesia");driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()时间.sleep(2)WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']/table//tr//td"))) #等待特定元素table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']/table//tr//td")对于 table_rows 中的行:打印rows.text驱动程序退出()

输出

更新1天前已发表9 天前地位开发中平台视窗评分(9)作者大卫克拉克类型生存, 解谜标签3D, 恐怖, 第一人称视角, 恐怖, 心理恐怖, 短片, 单人, 阴森, 团结平均会话几秒钟语言英语

I am using Scrapy with Selenium to scrape content from this page: https://nikmikk.itch.io/door-knocker

In it, there is a table under the div with class .game_info_panel_widget, where the first row Published 62 days ago seems to be loaded dynamically.

I have try fetching the page as Scrapy sees but cannot find that row in the html.

scrapy fetch --nolog https://nikmikk.itch.io/door-knocker > test.html

Here is what I see in test.html, the first table row is the Status, not the Published row like when I view page source directly in Chrome.

<div class="game_info_panel_widget">                                                                                                                                         
    <table>                                                                                                                                              
        <tbody>                                                                                                                                                  
           <tr>                                                                                                                                                      
               <td>Status</td>                                                                                                                                                       
               <td>Prototype</td>                                                                                                                                                            
               ...                                                                                                                                               

           </tr>
            ...

In my class SpiderDownloaderMiddleware, I have included Selenium:

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')

driver = webdriver.Chrome(chrome_options=options)

class SpiderDownloaderMiddleware(object):
# Omitted other codes
    def process_request(self, request, spider):
        driver.get(request.url)

        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, ".game_info_panel_widget"))
        )

        body = driver.page_source
        return HtmlResponse(driver.current_url, body=body, encoding='utf-8-sig', request=request)

How do I check how that row is loaded and how I can scrape those infos?

Updated: I followed @Yosuva A 's answer below and got something like this:

 9 days ago

In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English

But the output is inconsistent, sometimes it gives the desired one, sometimes it doesn't. I guess because Selenium waits for the general td element, which is common:

"//div[@class='game_info_panel_widget']//table//tr//td"

I have tried to modified to use td[@text='Published'] but Selenium gives timeout.

My code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome('chromedriver')  # Optional argument, if not specified will search path.
driver.implicitly_wait(15)

driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']//table//tr//td"))) #Wait for specific element 

table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']//table//tr//td")

for rows in table_rows:
    print(rows.text)

driver.quit()

Any other way?

Conclusion: It works if we time.sleep(2) after click() as suggested by Yosuva A.

解决方案

Please let me know whether this help or not

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome('/usr/local/bin/chromedriver')  # Optional argument, if not specified will search path.
driver.implicitly_wait(15)

driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()
time.sleep(2)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']/table//tr//td"))) #Wait for specific element 

table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']/table//tr//td")

for rows in table_rows:
    print rows.text

driver.quit()

Output

Updated
1 day ago
Published
9 days ago
Status
In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English

这篇关于Scrapy 与 Selenium 不检测动态加载的 HTML 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆