当我发送请求时如何使用随机用户代理? [英] how can i use random useragent everytitme when i send resquest?

查看:56
本文介绍了当我发送请求时如何使用随机用户代理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道如何在scrapy中使用randowm(假)用户代理.但是在我运行scrapy之后.我只能在终端上看到一个随机的用户代理.所以我猜当我运行scrapy时,'settings.py'可能只运行一次.如果scrapy真的像这样工作并向某个网页发送1000个请求以收集1000个数据,scrapy只会发送相同的用户代理.我认为很容易被禁止.

i know how to use randowm (fake) useragent in scrapy. but after i run scrapy. i could see only one random useragent on terminal. so i guessed maybe 'settings.py' run only one time when i run scrapy. if scrapy work really like this and send 1000 request to some web page to collect 1000 data, scrapy will just send same useragent. surly it can be easy to get ban i think.

当scrapy向某个网站发送请求时,你能告诉我如何发送随机用户代理吗?

can you tell me how can i send random useragent when scrapy send request to some website?

我在我的scrapy项目中使用了这个库(?).在我在 settings.py

i used this lib(?) in my scrapy project. after i set faker in user-agent in settings.py

https://pypi.org/project/Faker/

from faker import Faker 

fake = Faker() 
Faker.seed(fake.random_number()) 
fake_user_agent = fake.chrome() 

USER_AGENT = fake_user_agent 

在 settings.py 中我是这样写的.能不能很好用??

in settings.py i wrote like this. can it work well ??

推荐答案

如果您像在您的问题中一样在 settings.py 中设置 USER_AGENT ,那么您将得到整个抓取过程的单个(随机)用户代理.

If you are setting USER_AGENT in your settings.py like in your question then you will just get a single (random) user agent for your entire crawl.

如果您想为每个请求设置一个虚假的用户代理,您有几个选择.

You have a few options if you want to set a fake user agent for each request.

此方法涉及在 Request 直接.在您的蜘蛛代码中,您可以像上面那样导入 Faker ,然后调用例如fake.chrome() 在每个 Request 上.例如

This approach involves setting the user-agent in the headers of your Request directly. In your spider code you can import Faker like you do above but then call e.g. fake.chrome() on every Request. For example

# At the top of your file
from faker import Faker 

# This can be a global or class variable
fake = Faker() 

...

# When you make a Request 
yield Request(url, headers={"User-Agent": fake.chrome()})

选项 2:编写一个中间件来自动执行此操作

我不会讨论这个,因为你不妨使用一个已经存在的

Option 2: Write a middleware to do this automatically

I won't go into this because you might as well use one that already exists

如果您的代码中有很多请求,选项 1 不太好,所以您可以使用 Middleware 来为您做这件事.安装 scrapy-fake-useragent 后,您可以按照网页上的说明在设置文件中进行设置

If you have lots of requests in your code option 1 isn't so nice, so you can use a Middleware to do this for you. Once you've installed scrapy-fake-useragent you can set it up in your settings file as described on the webpage

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakeUserAgentProvider',
    'scrapy_fake_useragent.providers.FakerProvider',  
    'scrapy_fake_useragent.providers.FixedUserAgentProvider',
]

使用它,您将获得每个 Request 的新 user-agent,如果 Request 失败,您还将获得一个新的随机用户代理.设置此功能的关键部分之一是 FAKEUSERAGENT_PROVIDERS.这告诉我们从哪里获取 User-Agent.它们按照定义的顺序进行尝试,因此如果第一个由于某种原因失败(如果获取用户代理失败,而不是 Request 失败),则将尝试第二个.请注意,如果您想使用 Faker 作为主要提供者,那么您应该将其放在列表中的第一个

Using this you'll get a new user-agent per Request and if a Request fails you'll also get a new random user-agent. One of the key parts of setting this up is FAKEUSERAGENT_PROVIDERS. This tells us where to get the User-Agent from. They are tried in the order they are defined, so the second will be tried if the first one fails for some reason (if getting the user-agent fails, not if the Request fails). Note that if you want to use Faker as the primary provider, then you should put that one first in the list

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakerProvider',
    'scrapy_fake_useragent.providers.FakeUserAgentProvider',     
    'scrapy_fake_useragent.providers.FixedUserAgentProvider',
]

还有其他配置选项(例如使用类似 chrome 的随机用户代理,列在 scrapy-fake-useragent 文档中.

There are other configuration options (such as using a random chrome-like user-agent, listed in the scrapy-fake-useragent docs.

这是一个蜘蛛示例.为方便起见,我在蜘蛛程序中设置了设置,但您可以将它们放入您的 settings.py 文件中.

Here is an example spider. For convenience I set the settings inside the spider, but you can put these into your settings.py file.

# fake_user_agents.py
from scrapy import Spider


class FakesSpider(Spider):
    name = "fakes"
    start_urls = ["http://quotes.toscrape.com/"]
    custom_settings = dict(
        DOWNLOADER_MIDDLEWARES={
            "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
            "scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
            "scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 400,
            "scrapy_fake_useragent.middleware.RetryUserAgentMiddleware": 401,
        },
        FAKEUSERAGENT_PROVIDERS=[
            "scrapy_fake_useragent.providers.FakerProvider",
            "scrapy_fake_useragent.providers.FakeUserAgentProvider",
            "scrapy_fake_useragent.providers.FixedUserAgentProvider",
        ],
    )

    def parse(self, response):
        # Print out the user-agent of the request to check they are random
        print(response.request.headers.get("User-Agent"))

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

然后如果我用 scrapy runspider fake_user_agents.py --nolog 运行这个,输出是

Then if I run this with scrapy runspider fake_user_agents.py --nolog the output is

b'Mozilla/5.0 (Macintosh; PPC Mac OS X 10 11_0) AppleWebKit/533.1 (KHTML, like Gecko) Chrome/59.0.811.0 Safari/533.1'
b'Opera/8.18.(Windows NT 6.2; tt-RU) Presto/2.9.169 Version/11.00'
b'Opera/8.40.(X11; Linux i686; ka-GE) Presto/2.9.176 Version/11.00'
b'Opera/9.42.(X11; Linux x86_64; sw-KE) Presto/2.9.180 Version/12.00'
b'Mozilla/5.0 (Macintosh; PPC Mac OS X 10 5_1 rv:6.0; cy-GB) AppleWebKit/533.45.2 (KHTML, like Gecko) Version/5.0.3 Safari/533.45.2'
b'Opera/8.17.(X11; Linux x86_64; crh-UA) Presto/2.9.161 Version/11.00'
b'Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.1; Trident/3.1)'
b'Mozilla/5.0 (Android 3.1; Mobile; rv:55.0) Gecko/55.0 Firefox/55.0'
b'Mozilla/5.0 (compatible; MSIE 9.0; Windows CE; Trident/5.0)'
b'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 11_9; rv:1.9.4.20) Gecko/2019-07-26 10:00:35 Firefox/9.0'

这篇关于当我发送请求时如何使用随机用户代理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆