我需要为特定的用户代理编写一个网络爬虫 [英] I need to write a web crawler for specific user agent

查看:25
本文介绍了我需要为特定的用户代理编写一个网络爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要编写一个网络爬虫,并希望能够使用已知的用户代理进行爬网.比如我想让我的爬虫充当iphone来爬取一个网站的移动站点,然后再使用Mozilla PC代理等进行爬取.

I need to write a web crawler, and want to be able to crawl using a known user agent. For example, I want my crawler to act as an iphone to crawl the mobile site of a website, then crawl again using Mozilla PC agent, etc.

这样,我就可以抓取所有类型"的网站(移动和 PC).但是,我还希望能够设置我的抓取工具的用户代理,以便网站管理员在他们的统计数据中看到它是一个访问了他们整个网站的抓取工具,而不是真正的用户.

That way, Ill be able to crawl every "type" of site (mobile & PC). However, I also want to be able to set my crawler's user agent, so webmasters also see in their stats that it's a crawler that visited their whole website, not real users.

所以我的问题是,你们知道如何在 PHP 中同时设置移动代理 + 爬虫代理吗?甚至有可能吗?

So my question is, do you guys know how to set a mobile agent + a crawler agent at the same time, in PHP? Is it even possible?

推荐答案

请参考RFC1945关于如何形成用户代理:

10.15 用户代理

User-Agent 请求头字段包含有关发起请求的用户代理.这是出于统计目的,跟踪协议违规,自动识别用户代理为了定制响应以避免特定用户代理限制.虽然不是必需的,但用户代理应该将此字段包含在请求中.该字段可以包含多个产品令牌(第 3.7 节)和标识代理的注释和构成用户代理重要部分的任何子产品.经过约定,产品代币按其顺序列出对识别应用的意义.

The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. Although it is not required, user agents should include this field with requests. The field can contain multiple product tokens (Section 3.7) and comments identifying the agent and any subproducts which form a significant part of the user agent. By convention, the product tokens are listed in order of their significance for identifying the application.

 User-Agent     = "User-Agent" ":" 1*( product | comment )

示例:

  User-Agent: CERN-LineMode/2.15 libwww/2.17b3

所以你放的东西或多或少取决于你.您可以冒充 GoogleBot-Mobile:

So what you put there is more or less up to you. You could pose to be a GoogleBot-Mobile:

或冒充 iPhone 并添加您自己的东西

or pose as an iPhone and add your own stuff

Mozilla/5.0 (iPhone; U; CPU iPhone OS) (compatible; MyBot/1.0; +http://about.my/bot")

这篇关于我需要为特定的用户代理编写一个网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆