屏幕抓取:绕过“HTTP 错误 403:robots.txt 不允许请求"; [英] Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

查看:40
本文介绍了屏幕抓取:绕过“HTTP 错误 403:robots.txt 不允许请求";的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法解决以下问题?

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

这是联系网站所有者 (barnesandnoble.com) 的唯一途径.我正在建立一个可以为他们带来更多销售额的网站,但不知道为什么他们会拒绝一定深度的访问.

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

我在 Python2.6 上使用 mechanize 和 BeautifulSoup.

I'm using mechanize and BeautifulSoup on Python2.6.

希望有解决办法

推荐答案

如果你想进入,你可以尝试在你的用户代理上撒谎(例如,试图让你相信你是一个人而不是一个机器人)Barnes & 可能会遇到法律问题高贵.为什么不与他们的业务发展部门取得联系并说服他们特别授权您呢?毫无疑问,他们只是想避免让某些类别的机器人(例如价格比较引擎)抓取他们的网站,如果您可以说服他们您不是这样的机器人,签署合同等,他们很可能愿意对你来说是个例外.

You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

一种只会违反 robots.txt 中编码的策略的技术性"解决方法是一种我永远不会推荐的高法律风险方法.顺便说一句,他们的 robots.txt 是如何读取的?

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?

这篇关于屏幕抓取:绕过“HTTP 错误 403:robots.txt 不允许请求";的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆