Scrapy管道SQLAlchemy在输入DB之前检查项目是否存在? [英] Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?

查看:112
本文介绍了Scrapy管道SQLAlchemy在输入DB之前检查项目是否存在?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个抓抓的蜘蛛来抓取youtube vid并捕获,名称,子用户数,链接等.我从教程中复制了此SQLalchemy代码并使其正常工作,但是每次我运行抓取器时,我都会在数据库.

Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB.

如何检查刮取的数据是否已存在于数据库中,如果已存在,请不要进入数据库....

How do i check if the scraped data is already in the DB and if so, dont enter into the DB....

这是我的pipeline.py代码

Here is my pipeline.py code

from sqlalchemy.orm import sessionmaker
from models import Channels, db_connect, create_channel_table

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class YtscraperPipeline(object):

    """YTscraper  pipeline for storing scraped items in the database"""
    def __init__(self):

                #Initializes database connection and sessionmaker.
                #Creates deals table.
            engine = db_connect()
            create_channel_table(engine)
            self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
                """Save youtube channel in the database.

                This method is called for every item pipeline component.

                """
                session = self.Session()
                channel = Channels(**item)

                try:
                        session.add(channel)
                        session.commit()
                except:
                        session.rollback()
                        raise
                finally:
                        session.close()

                return item

这是我的模特.py

from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL

import settings


DeclarativeBase = declarative_base()


def db_connect():
    """
    Performs database connection using database settings from settings.py.
    Returns sqlalchemy engine instance
    """
    return create_engine(URL(**settings.DATABASE))


def create_channel_table(engine):
    """"""
    DeclarativeBase.metadata.create_all(engine)


class Channels(DeclarativeBase):
    """Sqlalchemy deals model"""
    __tablename__ = "ytchannels"

    id = Column(Integer, primary_key=True)
    ctitle = Column('title', String)
    clink = Column('link', String, nullable=True)
    csubs = Column('subs', String, nullable=True)

    date = Column('date', DateTime, nullable=True)

我不想重复添加到数据库中.我该怎么办?

I would like to not have duplicates added to the DB. How can i do that?

这是我每次运行转储表时得到的,基本上是一遍又一遍地添加相同的信息.

this is what i get when i dump the table each run, basically adds the same info over and over.

 id |        title         |                           link                           |  subs   |            date            
----+----------------------+----------------------------------------------------------+---------+----------------------------
  1 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:09:48.236281
  2 | DataDash             | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,691 | 2019-02-02 15:09:49.517085
  3 | Tone Vays            | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,588  | 2019-02-02 15:09:52.502221
  4 | Crypt0               | https://www.youtube.com/user/obham001                    | 119,046 | 2019-02-02 15:09:52.895278
  5 | The Modern Investor  | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,228 | 2019-02-02 15:09:52.990033
  6 | Decentralized TV     | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211  | 2019-02-02 15:09:53.108132
  7 | Crypto Daily         | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,341 | 2019-02-02 15:09:53.138157
  8 | RoadtoRoota          | https://www.youtube.com/user/RoadtoRoota                 | 54,954  | 2019-02-02 15:09:54.386956
  9 | Altcoin Buzz         | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,547 | 2019-02-02 15:09:54.412399
 10 | TheChartGuys         | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,431 | 2019-02-02 15:09:55.36888
 11 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:09:55.563061
 12 | Altcoin Daily        | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,543  | 2019-02-02 15:09:56.327525
 13 | The Moon             | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,291  | 2019-02-02 15:09:56.376596
 14 | Alessio Rastani      | https://www.youtube.com/user/alessiorastani              | 176,025 | 2019-02-02 15:09:56.439162
 15 | CryptosRUs           | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,387  | 2019-02-02 15:09:56.482699
 16 | Crypto Zombie        | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,715  | 2019-02-02 15:09:56.582438
 17 | Crypto Love          | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,999  | 2019-02-02 15:09:56.792019
 18 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333  | 2019-02-02 15:09:58.092356
 19 | sunny decree         | https://www.youtube.com/user/d3cr33                      | 80,294  | 2019-02-02 15:09:58.127674
 20 | Crypto Jebb          | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531  | 2019-02-02 15:09:58.396679
 21 | Chico Crypto         | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,144  | 2019-02-02 15:09:58.467988
 22 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:44:46.905164
 23 | DataDash             | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,688 | 2019-02-02 15:44:49.13279
 24 | Crypto Daily         | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,342 | 2019-02-02 15:44:50.450665
 25 | The Modern Investor  | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,226 | 2019-02-02 15:44:50.513322
 26 | Tone Vays            | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,589  | 2019-02-02 15:44:50.546499
 27 | Crypt0               | https://www.youtube.com/user/obham001                    | 119,040 | 2019-02-02 15:44:50.642958
 28 | Ivan on Tech         | https://www.youtube.com/user/LiljeqvistIvan              | 195,249 | 2019-02-02 15:44:50.951154
 29 | Decentralized TV     | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211  | 2019-02-02 15:44:51.191991
 30 | Altcoin Buzz         | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,546 | 2019-02-02 15:44:51.266842
 31 | Alessio Rastani      | https://www.youtube.com/user/alessiorastani              | 176,027 | 2019-02-02 15:44:51.420558
 32 | The Moon             | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,294  | 2019-02-02 15:44:52.020989
 33 | RoadtoRoota          | https://www.youtube.com/user/RoadtoRoota                 | 54,954  | 2019-02-02 15:44:52.177793
 34 | TheChartGuys         | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,437 | 2019-02-02 15:44:52.245701
 35 | Altcoin Daily        | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,538  | 2019-02-02 15:44:52.864349
 36 | Crypto Zombie        | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,716  | 2019-02-02 15:44:53.042814
 37 | CryptosRUs           | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,388  | 2019-02-02 15:44:53.246394
 38 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333  | 2019-02-02 15:44:53.54117
 39 | sunny decree         | https://www.youtube.com/user/d3cr33                      | 80,294  | 2019-02-02 15:44:54.288063
 40 | Crypto Love          | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,998  | 2019-02-02 15:44:54.591665
 41 | Crypto Jebb          | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531  | 2019-02-02 15:44:54.769744
 42 | Chico Crypto         | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,148  | 2019-02-02 15:44:55.791358

推荐答案

如果我正确理解您的话,则需要使用一些唯一的标识符来检查您的数据库中是否存在抓取的结果. 例如,您可以使用标题"列. 使用这种方法,您可以像这样修改process_item方法:

If I understood you correctly - you need to use some unique identifier in order to check if scraped results are present in your DB. For example, you can use 'title' column. Using this approach you can amend your process_item method like this:

class YtscraperPipeline(object):

    def __init__(self):
        #Initializes database connection and sessionmaker.
        engine = db_connect()
        create_channel_table(engine)
        Session = sessionmaker(bind=engine)
        self.session = Session()

    def process_item(self, item, spider):
        # check if item with this title exists in DB
        item_exists = self.session.query(Channels).filter_by(title=item['title']).first()
        # if item exists in DB - we just update 'date' and 'subs' columns.
        if item_exists:
            item_exists.date = item['date']
            item_exists.subs = item['subs'] 
            print('Item {} updated.'.format(item['title']))
        # if not - we insert new item to DB
        else:     
            new_item = Channels(**item)
            self.session.add(new_item)
            print('New item {} added to DB.'.format(item['title']))
        return item    

    def close_spider(self, spider):
        # We commit and save all items to DB when spider finished scraping.
        try:
            self.session.commit()
        except:
            self.session.rollback()
            raise
        finally:
            self.session.close()

这篇关于Scrapy管道SQLAlchemy在输入DB之前检查项目是否存在?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆