网页抓取阻止脚本访问 [英] web scraping of web that blocks access for scripts

查看：45 发布时间：2021/9/24 18:58:05 python web-scraping

本文介绍了网页抓取阻止脚本访问的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我曾经使用 python 脚本 (urllib) 抓取一个网站.似乎该网站现在阻止了我的请求，每当我使用脚本请求网页时，我都会得到一个带有一些 JS 的 html，但没有通常的数据.从我的浏览器访问网站工作正常.我尝试更改用户代理"以适应我的浏览器使用的那个，但没有帮助.我观察到的一个奇怪行为是，从浏览器访问页面后，我也可以从脚本访问它.

There's a website I used to scrape using a python script (urllib). It seems the website is now blocking my requests and whenever I'm requesting a web page using a script I get an html with some JS but without the usual data. Accessing the website from my browser works just fine. I tried changing the 'User-agent' to fit the one my browser uses but it didn't help. A strange behavior I observed is that after accessing a page from my browser I can access it from the script too.

所以我的问题是:

服务器如何检测它不是浏览器(在我更改用户代理后)?
什么样的机制会导致浏览器加载网页后才允许访问的奇怪行为?是缓存吗?如果是，缓存发生在哪里?
任何想法如何进行?(我有一个不太优雅的解决方案，让我的浏览器在加载之前打开每个页面，但这需要太多时间)

谢谢！

推荐答案

没有太多细节可供参考，听起来该站点已更新为包含 javascript 加载器.urllib 无法处理 javascript，因此无法继续.(此处纯属猜测)

Without too many details to go from, it sounds like the site updated to include a javascript loader. urllib can't process the javascript, so it's unable to continue. (pure speculation here)

站点可以通过多种方式尝试阻止抓取工具访问它，包括设置一些 Javascript 或更新 cookie，或以某种方式修改会话以通过第一个测试.它完全取决于站点，因此您必须手动对其进行调查.

There's various ways a site can try to prevent a scraper from accessing it, including having some Javascript set or update a cookie, or modify the session in some way as to pass this first test. It's completely site dependent, so you'll have to investigate it by hand.

通常的解决方案是使用像 Selenium 这样的 javascript 感知抓取工具，它实际上使用本地安装的 Firefox、Chrome 或 IE 浏览器打开页面，模拟点击项目.您也可以使用PhantomJS 来处理下载的页面.

The usual solution is to use a javascript aware scraper like Selenium, which actually uses a locally installed Firefox, Chrome or IE browser to open the page, and simulate clicking of items. You can also use PhantomJS to process the downloaded page.

有很多关于此的帖子，但这里有一个可以为您提供一个起点:使用 Python 抓取 JavaScript 页面

There's plenty of posts on SO about this, but here's one that may give you a starting point: Web-scraping JavaScript page with Python

这篇关于网页抓取阻止脚本访问的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

网页抓取阻止脚本访问 [英] web scraping of web that blocks access for scripts

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

网页抓取阻止脚本访问 [英] web scraping of web that blocks access for scripts

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭