Python请求与robots.txt [英] Python requests vs. robots.txt

查看:192
本文介绍了Python请求与robots.txt的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个供个人使用的脚本,该脚本可以抓取一些网站的信息,直到最近它还可以正常工作,但是似乎其中一个网站增强了它的安全性,我再也无法访问其内容.

I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents.

我正在使用带有请求的python和BeautifulSoup来抓取数据,但是当我尝试通过请求获取网站的内容时,我遇到了以下问题:

I'm using python with requests and BeautifulSoup to scrape the data, but when I try to grab the content of the website with requests, I run into the following:

'<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9_4E402615&incident_id=133000790078576866-343390778581910775&edet=12&cinfo=4bb304cac75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133000790078576866-343390778581910775</iframe></html>'

我做了一些研究,看来这就是阻止我的原因: http://www.robotstxt.org/meta.html

I've done a bit of research, and it looks like this is what's stopping me: http://www.robotstxt.org/meta.html

有什么办法可以说服网站我不是恶意机器人?这是一个脚本,我每天只在一个源上运行大约1次,因此无论如何我都不会对它们的服务器造成任何负担.只是有脚本的人可以使事情变得简单:)

Is there any way I can convince the website that I'm not a malicious robot? This is a script I run ~1 time per day on a single bit of source, so I'm not really a burden on their servers by any means. Just someone with a script to make things easier :)

尝试以这种方式切换到机械化和忽略robots.txt,但是我没有收到403禁止响应.我想他们已经改变了对报废的立场,还没有更新他们的服务条款.是时候去计划B了,除非有人有其他想法,否则就不再使用该网站.

Tried switching to mechanize and ignoring robots.txt that way, but I'm not getting a 403 Forbidden response. I suppose they have changed their stance on scraping and have not updated their TOS yet. Time to go to Plan B, by no longer using the website unless anyone has any other ideas.

推荐答案

最有可能发生的事情是服务器正在检查user-agent并拒绝访问机器人所使用的默认user-agent.

What is most likely happening is the Server is checking the user-agent and denying access to the default user-agent used by bots.

例如requestsuser-agent设置为python-requests/2.9.1

您可以自行指定标题.

url = "https://google.com"
UAS = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1", 
       "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
       "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )

ua = UAS[random.randrange(len(UAS))]

headers = {'user-agent': ua}
r = requests.get(url, headers=headers)

这篇关于Python请求与robots.txt的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆