Scrapy - 爬行时访问数据并随机更改用户代理 [英] Scrapy - access data while crawling and randomly change user agent

查看:28
本文介绍了Scrapy - 爬行时访问数据并随机更改用户代理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scrapy 爬行时是否可以访问数据?我有一个脚本,用于查找特定关键字并将关键字写入 .csv 以及找到它的链接.但是,我必须等待scrapy完成抓取,完成后它实际上将数据输出到.csv文件中

Is it possible to access the data while scrapy is crawling ? I have a script that finds a specific keyword and writes the keyword in .csv as well as the link where it was found. However, I have to wait for scrapy to be done crawling, and when that is done it actually outputs the data in the .csv file

我也在尝试随机更改我的用户代理,但它不起作用.如果我不允许将两个问题合二为一,我会将其作为一个单独的问题发布.

I'm also trying to change my user agent randomly, but it's not working. If I'm not allowed for two questions in one, i will post this as a separate question.

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
from scrapy.spiders import Spider
from scrapy import log
from FinalSpider.items import Page
from FinalSpider.settings import USER_AGENT_LIST
from FinalSpider.settings import DOWNLOADER_MIDDLEWARES

import random
import telnetlib
import time
 
 
class FinalSpider(Spider):
    name = "FinalSpider"
    allowed_domains = ['url.com']
    start_urls = ['url.com=%d' %(n)
              for n in xrange(62L, 62L)]


    def parse(self, response):
        item = Page()

        item['URL'] = response.url
        item['Stake'] = ''.join(response.xpath('//div[@class="class"]//span[@class="class" or @class="class"]/text()').extract())
        if item['cur'] in [u'50,00', u'100,00']:
            return item

# 30% useragent change
class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 30:
            log.msg('Changing UserAgent')
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)
            log.msg('>>>> UserAgent changed')

推荐答案

您没有义务将收集的项目(又名数据")输出到 csv 文件中,您只能使用以下命令运行 scrapy:

You are not obliged to output your collected items (aka "data") into a csv file, you can only run scrapy with:

scrapy crawl myspider

这会将日志输出到终端,但为了将项目仅存储到 csv 文件中,我假设您正在执行以下操作:

This will be outputting the logs into the terminal, but for storing just the items into a csv file I assume you are doing something like this:

scrapy crawl myspider -o items.csv

现在如果你想存储日志和项目,我建议你把它放到你的 settings.py 文件中:

Now if you want to store the logs and the items, I suggest you go put this into your settings.py file:

LOG_FILE = "logfile.log"

现在你可以在蜘蛛运行时看到一些东西,只需检查那个文件.

now you can see something while the spider runs just checking that file.

对于您的 randomuseragent 问题,请检查如何激活 scrapy 中间件.

For your problem with the randomuseragent, please check how to activate scrapy middlewares.

这篇关于Scrapy - 爬行时访问数据并随机更改用户代理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆