在使用 Python 无头抓取时填充字段的 Selenium/Webdriver 替代方案? [英] Alternatives to Selenium/Webdriver for filling in fields when scraping headlessly with Python?
问题描述
在 Python 2.7 中,我使用 urllib2 进行抓取,当需要一些 Xpath 时,也使用 lxml.它快速,而且因为我很少需要在网站上浏览,所以这种组合效果很好.但有时,通常当我到达一个页面时,该页面只会在填写简短表格并单击提交按钮时显示一些有价值的数据(example),仅使用 urllib2 的抓取方法是不够的.
With Python 2.7 I'm scraping with urllib2 and when some Xpath is needed, lxml as well. It's fast, and because I rarely have to navigate around the sites, this combination works well. On occasion though, usually when I reach a page that will only display some valuable data when a short form is filled in and a submit button is clicked (example), the scraping-only approach with urllib2 is not sufficient.
每次遇到这样的页面时,我都可以调用 selenium.webdriver
来重新获取页面并进行表单填写和点击,但这会大大减慢速度.
Each time such a page were encountered, I could invoke selenium.webdriver
to refetch the page and do the form-filling and clicking, but this will slow things down considerably.
注意:这个问题不是关于 urllib2 的优点或局限性,我知道已经有很多讨论.相反,它只专注于寻找一种快速、无头的表单填写方法等(如果需要,还可以使用 XPath 查询).
NOTE: This question is not about the merits or limitations of urllib2, about which I aware there have been many discussions. It's instead focussed only on finding a fast, headless approach to form-filling etc. (one that will also allow for XPath queries if needed).
推荐答案
您可以考虑使用以下几个方面:
There are several things you can consider using:
机械化
robobrowser
selenium
带有无头浏览器,例如PhantomJS
,例如,或使用常规浏览器,但在虚拟显示
mechanize
robobrowser
selenium
with a headless browser, likePhantomJS
, for example, or using a regular browser, but in a Virtual Display
这篇关于在使用 Python 无头抓取时填充字段的 Selenium/Webdriver 替代方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!