在使用 Python 无头抓取时填充字段的 Selenium/Webdriver 替代方案? [英] Alternatives to Selenium/Webdriver for filling in fields when scraping headlessly with Python?

查看:50
本文介绍了在使用 Python 无头抓取时填充字段的 Selenium/Webdriver 替代方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 2.7 中,我使用 urllib2 进行抓取,当需要一些 Xpath 时,也使用 lxml.它快速,而且因为我很少需要在网站上浏览,所以这种组合效果很好.但有时,通常当我到达一个页面时,该页面只会在填写简短表格并单击提交按钮时显示一些有价值的数据(example),仅使用 urllib2 的抓取方法是不够的.

With Python 2.7 I'm scraping with urllib2 and when some Xpath is needed, lxml as well. It's fast, and because I rarely have to navigate around the sites, this combination works well. On occasion though, usually when I reach a page that will only display some valuable data when a short form is filled in and a submit button is clicked (example), the scraping-only approach with urllib2 is not sufficient.

每次遇到这样的页面时,我都可以调用 selenium.webdriver 来重新获取页面并进行表单填写和点击,但这会大大减慢速度.

Each time such a page were encountered, I could invoke selenium.webdriver to refetch the page and do the form-filling and clicking, but this will slow things down considerably.

注意:这个问题不是关于 urllib2 的优点或局限性,我知道已经有很多讨论.相反,它只专注于寻找一种快速、无头的表单填写方法等(如果需要,还可以使用 XPath 查询).

NOTE: This question is not about the merits or limitations of urllib2, about which I aware there have been many discussions. It's instead focussed only on finding a fast, headless approach to form-filling etc. (one that will also allow for XPath queries if needed).

推荐答案

您可以考虑使用以下几个方面:

There are several things you can consider using:

  • mechanize
  • robobrowser
  • selenium with a headless browser, like PhantomJS, for example, or using a regular browser, but in a Virtual Display

这篇关于在使用 Python 无头抓取时填充字段的 Selenium/Webdriver 替代方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆