有什么方法可以告诉 selenium 在某些时候不执行 js? [英] Any way to tell selenium don't execute js at some point?

查看:68
本文介绍了有什么方法可以告诉 selenium 在某些时候不执行 js?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取一个由 js 生成的内容的网站.该站点每 5 秒运行一次 js 更新内容(请求新的加密 js 文件,无法解析).

I want to crawl a site which have some generated content by js. That site run a js update content every 5 second (request a new encripted js file, can't parse).

我的代码:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)

driver.get(url)

trs = driver.find_elements_by_css_selector('.table tbody tr')

print len(trs)

for tr in trs:
    try:
        items.append(tr.text)
    except:
        # because the js update content, so this tr is missing
        pass

print len(items)

len(items)len(trs) 不匹配.如何告诉 selenium 在我运行 trs = driver.find_elements_by_css_selector('.table tbody tr') 后停止执行 js 或停止工作?

len(items) would not match len(trs). How to tell selenium stop executing js or stop working after I run trs = driver.find_elements_by_css_selector('.table tbody tr') ?

我稍后需要使用trs,所以不能driver.quit()

I need use trs later, so can not driver.quit()

异常详情:

---------------------------------------------------------------------------
StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-84-b80e3579efca> in <module>()
     11 items = []
     12 for tr in trs:
---> 13     items.append(tr.text)
     14     #items.append(map_label(hidemyass_label, tr.find_elements_by_tag_name('td')))
     15 

C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in text(self)
     69     def text(self):
     70         """The text of the element."""
---> 71         return self._execute(Command.GET_ELEMENT_TEXT)['value']
     72 
     73     def click(self):

C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in _execute(self, command, params)
    452             params = {}
    453         params['id'] = self._id
--> 454         return self._parent.execute(command, params)
    455 
    456     def find_element(self, by=By.ID, value=None):

C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.pyc in execute(self, driver_command, params)
    199         response = self.command_executor.execute(driver_command, params)
    200         if response:
--> 201             self.error_handler.check_response(response)
    202             response['value'] = self._unwrap_value(
    203                 response.get('value', None))

C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.pyc in check_response(self, response)
    179         elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
    180             raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 181         raise exception_class(message, screen, stacktrace)
    182 
    183     def _value_or_default(self, obj, key, default):

StaleElementReferenceException: Message: {"errorMessage":"Element is no longer attached to the DOM","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:63305","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"GET","url":"/text","urlParsed":{"anchor":"","query":"","file":"text","directory":"/","path":"/text","relative":"/text","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/text","queryKey":{},"chunks":["text"]},"urlOriginal":"/session/4bb16340-a3b6-11e5-8ce5-9d0be40203a6/element/%3Awdc%3A1450243990539/text"}}
Screenshot: available via screen

显然 tr 不见了.

PS:我需要使用硒来选择元素.其他库如 lxmlpyquery 不知道哪个元素是 display:none 与否、.text() 经常在 <script> 中得到注释或其他东西,等等错误.遗憾的是,python 没有完美的 Jquery 克隆.

PS: I need use selenium to select element. Other libs like lxml, pyquery don't know which element is display:none or not, .text() often get comment or something in <script> , and so on bugs. It's sad that python do not have a perfect clone of Jquery.

推荐答案

使用scrapy.确定页面已加载后,使用以下命令抓取正文:

Use scrapy. Once you are sure the page has loaded, grab the body using:

response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')

您现在拥有页面的静态副本,以便您可以使用scrapy 的 response.xpath 来提取您需要的任何数据.这个答案作为更多细节.

You now have a static copy of the page so that you can use scrapy's response.xpath to pull whatever data you need. This answer as more detail.

这篇关于有什么方法可以告诉 selenium 在某些时候不执行 js?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆