Scrapy 与 Selenium 不检测动态加载的 HTML 元素 [英] Scrapy with Selenium does not detect HTML element loaded dynamically
问题描述
我使用 Scrapy 和 Selenium 来从这个页面抓取内容:https://nikmikk.itch.io/门铃
其中,div下有一个.game_info_panel_widget
类的表格,第一行Published 62 days ago
好像是动态加载的.
我尝试像 Scrapy 一样获取页面,但在 html 中找不到该行.
scrapy fetch --nolog https://nikmikk.itch.io/door-knocker >测试.html
这是我在 test.html
中看到的,第一个表格行是状态,而不是像我直接在 Chrome 中查看页面源时那样的已发布行.
<表格><tr><td>状态</td><td>原型</td>...</tr>...在我的类 SpiderDownloaderMiddleware
中,我包含了 Selenium:
options = webdriver.ChromeOptions()options.add_argument('headless')options.add_argument('window-size=1200x600')驱动程序 = webdriver.Chrome(chrome_options=options)类 SpiderDownloaderMiddleware(对象):# 省略其他代码def process_request(self, request, spider):driver.get(request.url)WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".game_info_panel_widget")))body = driver.page_source返回 HtmlResponse(driver.current_url, body=body, encoding='utf-8-sig', request=request)
如何检查该行的加载方式以及如何抓取这些信息?
更新:我按照@Yosuva A 在下面的回答得到了这样的结果:
9 天前开发中平台视窗评分(9)作者大卫克拉克类型生存, 解谜标签3D, 恐怖, 第一人称视角, 恐怖, 心理恐怖, 短片, 单人, 阴森, 团结平均会话几秒钟语言英语
但是输出不一致,有时它给出了想要的,有时却没有.我猜是因为 Selenium 等待通用的 td
元素,这很常见:
"//div[@class='game_info_panel_widget']//table//tr//td"
我尝试修改为使用 td[@text='Published']
但 Selenium 超时.
我的代码:
from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECdriver = webdriver.Chrome('chromedriver') # 可选参数,如果不指定将搜索路径.driver.implicitly_wait(15)driver.get("https://thehive.itch.io/promnesia");driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']//table//tr//td"))) #等待特定元素table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']//table//tr//td")对于 table_rows 中的行:打印(行.文本)驱动程序退出()
还有其他方法吗?
结论:如果我们按照 Yosuva A 的建议在 click()
之后 time.sleep(2)
,它会起作用.
解决方案 请让我知道这是否有帮助
from selenium import webdriverfrom selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECdriver = webdriver.Chrome('/usr/local/bin/chromedriver') # 可选参数,如果不指定将搜索路径.driver.implicitly_wait(15)driver.get("https://thehive.itch.io/promnesia");driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()时间.sleep(2)WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']/table//tr//td"))) #等待特定元素table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']/table//tr//td")对于 table_rows 中的行:打印rows.text驱动程序退出()
输出
更新1天前已发表9 天前地位开发中平台视窗评分(9)作者大卫克拉克类型生存, 解谜标签3D, 恐怖, 第一人称视角, 恐怖, 心理恐怖, 短片, 单人, 阴森, 团结平均会话几秒钟语言英语
I am using Scrapy with Selenium to scrape content from this page: https://nikmikk.itch.io/door-knocker
In it, there is a table under the div with class .game_info_panel_widget
, where the first row Published 62 days ago
seems to be loaded dynamically.
I have try fetching the page as Scrapy sees but cannot find that row in the html.
scrapy fetch --nolog https://nikmikk.itch.io/door-knocker > test.html
Here is what I see in test.html
, the first table row is the Status, not the Published row like when I view page source directly in Chrome.
<div class="game_info_panel_widget">
<table>
<tbody>
<tr>
<td>Status</td>
<td>Prototype</td>
...
</tr>
...
In my class SpiderDownloaderMiddleware
, I have included Selenium:
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(chrome_options=options)
class SpiderDownloaderMiddleware(object):
# Omitted other codes
def process_request(self, request, spider):
driver.get(request.url)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".game_info_panel_widget"))
)
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding='utf-8-sig', request=request)
How do I check how that row is loaded and how I can scrape those infos?
Updated:
I followed @Yosuva A 's answer below and got something like this:
9 days ago
In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English
But the output is inconsistent, sometimes it gives the desired one, sometimes it doesn't. I guess because Selenium waits for the general td
element, which is common:
"//div[@class='game_info_panel_widget']//table//tr//td"
I have tried to modified to use td[@text='Published']
but Selenium gives timeout.
My code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('chromedriver') # Optional argument, if not specified will search path.
driver.implicitly_wait(15)
driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']//table//tr//td"))) #Wait for specific element
table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']//table//tr//td")
for rows in table_rows:
print(rows.text)
driver.quit()
Any other way?
Conclusion:
It works if we time.sleep(2)
after click()
as suggested by Yosuva A.
解决方案 Please let me know whether this help or not
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('/usr/local/bin/chromedriver') # Optional argument, if not specified will search path.
driver.implicitly_wait(15)
driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()
time.sleep(2)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']/table//tr//td"))) #Wait for specific element
table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']/table//tr//td")
for rows in table_rows:
print rows.text
driver.quit()
Output
Updated
1 day ago
Published
9 days ago
Status
In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English
这篇关于Scrapy 与 Selenium 不检测动态加载的 HTML 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文
相关文章
-
selenium 与scrapy 用于动态页面;
-
Scrapy 和 Selenium 提交动态呈现的表单;
-
硒与scrapy的动态页面;
-
jQuery绑定事件动态加载html元素;
-
DisclosureIndicator不检测触摸;
-
在python selenium中与动态标签交互元素;
-
Scrapy + Selenium + Datepicker;
-
如何让 Selenium 与 Scrapy 并行运行?;
-
检测Html元素是否与另一个Html元素重叠;
-
(Scrapy) 如何获取 HTML 元素的 CSS 规则?;
-
Qt程序不检测库;
-
tensorflow 1.13.1 不检测 gpu;
-
matplotlib 不检测字体;
-
Kivy 不检测 OpenGL 2.0;
-
Html div加载事件中动态添加的div元素;
-
等待元素使用Selenium加载;
-
Scrapy 与 selenium 用于需要身份验证的网页;
-
Symfony的不检测相对URL根;
-
makemigrations不检测模型中的变化;
-
Symfony 不检测相对 url 根;
-
Flask-Migrate 不检测表;
-
如何检测重叠的HTML元素;
-
jQuery加载动态元素;
-
Python + Selenium:等到元素完全加载;
-
Python + Selenium:等待元素完全加载;
Python最新文章
-
类型错误:只有长度为1的阵列可以尝试拟合指数的数据转换到Python标量;
-
bs4.FeatureNotFound:找不到一棵树建设者您所要求的功能:LXML。你需要安装一个解析器库?;
-
系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all();
-
(unicode错误)'unicodeescape'编解码器无法解码位置2-3中的字节:truncated \UXXXXXXXX escape;
-
将pandas dataframe中的列从int转换为string;
-
Python:由实例对象调用方法:“missing 1 required positional argument:'self'”;
-
Sparksql过滤与多个条件(与where子句中选择);
-
JSONDe codeError:期待值:1行1列(CHAR 0);
-
Cmake不能找到Python库;
-
Python - 将Dataframe中的所有项目转换为字符串;
热门教程
登录
关闭
扫码关注1秒登录
发送“验证码”获取
|
15天全站免登陆