网页抓取阻止脚本访问 [英] web scraping of web that blocks access for scripts
问题描述
我曾经使用 python 脚本 (urllib) 抓取一个网站.似乎该网站现在阻止了我的请求,每当我使用脚本请求网页时,我都会得到一个带有一些 JS 的 html,但没有通常的数据.从我的浏览器访问网站工作正常.我尝试更改用户代理"以适应我的浏览器使用的那个,但没有帮助.我观察到的一个奇怪行为是,从浏览器访问页面后,我也可以从脚本访问它.
There's a website I used to scrape using a python script (urllib). It seems the website is now blocking my requests and whenever I'm requesting a web page using a script I get an html with some JS but without the usual data. Accessing the website from my browser works just fine. I tried changing the 'User-agent' to fit the one my browser uses but it didn't help. A strange behavior I observed is that after accessing a page from my browser I can access it from the script too.
所以我的问题是:
- 服务器如何检测它不是浏览器(在我更改用户代理后)?
- 什么样的机制会导致浏览器加载网页后才允许访问的奇怪行为?是缓存吗?如果是,缓存发生在哪里?
- 任何想法如何进行?(我有一个不太优雅的解决方案,让我的浏览器在加载之前打开每个页面,但这需要太多时间)
谢谢!
推荐答案
没有太多细节可供参考,听起来该站点已更新为包含 javascript 加载器.urllib
无法处理 javascript,因此无法继续.(此处纯属猜测)
Without too many details to go from, it sounds like the site updated to include a javascript loader. urllib
can't process the javascript, so it's unable to continue. (pure speculation here)
站点可以通过多种方式尝试阻止抓取工具访问它,包括设置一些 Javascript 或更新 cookie,或以某种方式修改会话以通过第一个测试.它完全取决于站点,因此您必须手动对其进行调查.
There's various ways a site can try to prevent a scraper from accessing it, including having some Javascript set or update a cookie, or modify the session in some way as to pass this first test. It's completely site dependent, so you'll have to investigate it by hand.
通常的解决方案是使用像 Selenium
这样的 javascript 感知抓取工具,它实际上使用本地安装的 Firefox
、Chrome
或 IE
浏览器打开页面,模拟点击项目.您也可以使用PhantomJS
来处理下载的页面.
The usual solution is to use a javascript aware scraper like Selenium
, which actually uses a locally installed Firefox
, Chrome
or IE
browser to open the page, and simulate clicking of items. You can also use PhantomJS
to process the downloaded page.
有很多关于此的帖子,但这里有一个可以为您提供一个起点:使用 Python 抓取 JavaScript 页面
There's plenty of posts on SO about this, but here's one that may give you a starting point: Web-scraping JavaScript page with Python
这篇关于网页抓取阻止脚本访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!