使用中间件忽略 Scrapy 中的重复项 [英] Using Middleware to ignore duplicates in Scrapy

查看:37
本文介绍了使用中间件忽略 Scrapy 中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Python 初学者,我正在将 Scrapy 用于个人网络项目.

I'm beginner in Python, and I'm using Scrapy for a personnel web project.

我使用 Scrapy 从多个网站重复提取数据,因此在添加链接之前,我需要在每次抓取时检查数据库中是否已存在链接.我在 piplines.py 类中做了这个:

I use Scrapy to extract data from several websites repeatedly, so I need to check on every crawling if a link is already in the database before adding it. I did this in a piplines.py class:

class DuplicatesPipline(object):
    def process_item(self, item, spider):
        if memc2.get(item['link']) is None:
            return item
        else:
            raise DropItem('Duplication %s', item['link'])

但我听说使用中间件更适合这项任务.

But I heard that using Middleware is better for this task.

我发现在 Scrapy 中使用中间件有点困难,任何人都可以将我重定向到一个好的教程.

I found it a little hard to use Middleware in Scrapy, can anyone please redirect me to a good tutorial.

欢迎提供建议.

谢谢,

我正在使用 MySql 和 memcache.

I'm using MySql and memcache.

根据@Talvalin 的回答,这是我的尝试:

Here is my try according to @Talvalin answer:

# -*- coding: utf-8 -*-

from scrapy.exceptions import IgnoreRequest
import MySQLdb as mdb
import memcache

connexion = mdb.connect('localhost','dev','passe','mydb')
memc2 = memcache.Client(['127.0.0.1:11211'], debug=1)

class IgnoreDuplicates():

    def __init__(self):
        #clear memcache object
        memc2.flush_all()

        #update memc2
        with connexion:
            cur = connexion.cursor()
            cur.execute('SELECT link, title FROM items')
            for item in cur.fetchall():
                memc2.set(item[0], item[1])

    def precess_request(self, request, spider):
        #if the url is not in memc2 keys, it returns None.
        if memc2.get(request.url) is None:
            return None
        else:
            raise IgnoreRequest()

<小时>

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.IgnoreDuplicates': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500, }

不过好像是爬行的时候忽略了process_request方法.

But it seems that the process_request method is ignored when crawling.

提前致谢,

推荐答案

下面是一些示例中间件代码,它从 sqlite3 表 (Id INT, url TEXT) 中加载 url 到一个集合中,然后检查针对集合请求 url 以确定是否应忽略该 url.调整此代码以使用 MySQL 和 memcache 应该相当简单,但如果您有任何问题或疑问,请告诉我.:)

Here's some example middleware code that loads urls from a sqlite3 table (Id INT, url TEXT)into a set, and then checks request urls against the set to determine if the url should be ignored or not. It should be reasonably straightforward to adapt this code to use MySQL and memcache, but please let me know if you have any issues or questions. :)

import sqlite3
from scrapy.exceptions import IgnoreRequest

class IgnoreDuplicates():

    def __init__(self):
        self.crawled_urls = set()

        with sqlite3.connect('C:\dev\scrapy.db') as conn:
            cur = conn.cursor()
            cur.execute("""SELECT url FROM CrawledURLs""")
            self.crawled_urls.update(x[0] for x in cur.fetchall())

        print self.crawled_urls

    def process_request(self, request, spider):
        if request.url in self.crawled_urls:
            raise IgnoreRequest()
        else:
            return None

如果您遇到像我一样的导入问题并且即将打孔,上面的代码位于 middlewares.py 文件中,该文件位于顶级项目文件夹中以下DOWNLOADER_MIDDLEWARES设置

On the off-chance you have import issues like me and are about to punch your monitor, the code above was in a middlewares.py file that was placed in the top-level project folder with the following DOWNLOADER_MIDDLEWARES setting

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.IgnoreDuplicates': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 500,
}

这篇关于使用中间件忽略 Scrapy 中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆