屏幕抓取:让周围" HTTP错误403:要求通过robots.txt的&QUOT不允许的; [英] Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

查看:930
本文介绍了屏幕抓取:让周围" HTTP错误403:要求通过robots.txt的&QUOT不允许的;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法来解决以下?

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

是解决这个问题的唯一途径联系网站所有者(barnesandnoble.com)..我建立一个网站,将带给他们更多的销售,不知道他们为什么会否认在一定深度的访问。

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

我使用机械化和BeautifulSoup上python2.6的。

I'm using mechanize and BeautifulSoup on Python2.6.

希望有一个变通

推荐答案

您可以尝试躺在你的用户代理(例如,通过努力使相信你是一个人,而不是机器人),如果你想在巴恩斯&安培可能的法律麻烦;高贵。为什么反而没有得到与他们的业务开发部联系,说服他们特别授权吗?他们无疑只是想避免让自己的网站由一些类的机器人,如价格对比引擎刮掉,如果你能说服他们,你是不是一个,签订合同等,他们可能也愿意作出一个例外适合你。

You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

一个技术的解决方法,只是打破了他们的政策,因为en robots.txt中codeD是高法律风险的方法,我绝不会推荐。 BTW,如何的确实的其ro​​bots.txt读?

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?

这篇关于屏幕抓取:让周围" HTTP错误403:要求通过robots.txt的&QUOT不允许的;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆