Scrapy管道SQLAlchemy在输入DB之前检查项目是否存在? [英] Scrapy pipeline SQLAlchemy Check if item exists before entering to DB?
问题描述
我正在编写一个抓抓的蜘蛛来抓取youtube vid并捕获,名称,子用户数,链接等.我从教程中复制了此SQLalchemy代码并使其正常工作,但是每次我运行抓取器时,我都会在数据库.
Im writing a scrapy spider to crawl youtube vids and capture, name, subsrciber count, link, etc. I copied this SQLalchemy code from a tutorial and got it working, but every time i run the crawler i get duplicated info in the DB.
如何检查刮取的数据是否已存在于数据库中,如果已存在,请不要进入数据库....
How do i check if the scraped data is already in the DB and if so, dont enter into the DB....
这是我的pipeline.py代码
Here is my pipeline.py code
from sqlalchemy.orm import sessionmaker
from models import Channels, db_connect, create_channel_table
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class YtscraperPipeline(object):
"""YTscraper pipeline for storing scraped items in the database"""
def __init__(self):
#Initializes database connection and sessionmaker.
#Creates deals table.
engine = db_connect()
create_channel_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save youtube channel in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
channel = Channels(**item)
try:
session.add(channel)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
这是我的模特.py
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings
DeclarativeBase = declarative_base()
def db_connect():
"""
Performs database connection using database settings from settings.py.
Returns sqlalchemy engine instance
"""
return create_engine(URL(**settings.DATABASE))
def create_channel_table(engine):
""""""
DeclarativeBase.metadata.create_all(engine)
class Channels(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "ytchannels"
id = Column(Integer, primary_key=True)
ctitle = Column('title', String)
clink = Column('link', String, nullable=True)
csubs = Column('subs', String, nullable=True)
date = Column('date', DateTime, nullable=True)
我不想重复添加到数据库中.我该怎么办?
I would like to not have duplicates added to the DB. How can i do that?
这是我每次运行转储表时得到的,基本上是一遍又一遍地添加相同的信息.
this is what i get when i dump the table each run, basically adds the same info over and over.
id | title | link | subs | date
----+----------------------+----------------------------------------------------------+---------+----------------------------
1 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:09:48.236281
2 | DataDash | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,691 | 2019-02-02 15:09:49.517085
3 | Tone Vays | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,588 | 2019-02-02 15:09:52.502221
4 | Crypt0 | https://www.youtube.com/user/obham001 | 119,046 | 2019-02-02 15:09:52.895278
5 | The Modern Investor | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,228 | 2019-02-02 15:09:52.990033
6 | Decentralized TV | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211 | 2019-02-02 15:09:53.108132
7 | Crypto Daily | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,341 | 2019-02-02 15:09:53.138157
8 | RoadtoRoota | https://www.youtube.com/user/RoadtoRoota | 54,954 | 2019-02-02 15:09:54.386956
9 | Altcoin Buzz | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,547 | 2019-02-02 15:09:54.412399
10 | TheChartGuys | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,431 | 2019-02-02 15:09:55.36888
11 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:09:55.563061
12 | Altcoin Daily | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,543 | 2019-02-02 15:09:56.327525
13 | The Moon | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,291 | 2019-02-02 15:09:56.376596
14 | Alessio Rastani | https://www.youtube.com/user/alessiorastani | 176,025 | 2019-02-02 15:09:56.439162
15 | CryptosRUs | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,387 | 2019-02-02 15:09:56.482699
16 | Crypto Zombie | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,715 | 2019-02-02 15:09:56.582438
17 | Crypto Love | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,999 | 2019-02-02 15:09:56.792019
18 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333 | 2019-02-02 15:09:58.092356
19 | sunny decree | https://www.youtube.com/user/d3cr33 | 80,294 | 2019-02-02 15:09:58.127674
20 | Crypto Jebb | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531 | 2019-02-02 15:09:58.396679
21 | Chico Crypto | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,144 | 2019-02-02 15:09:58.467988
22 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:44:46.905164
23 | DataDash | https://www.youtube.com/channel/UCCatR7nWbYrkVXdxXb4cGXw | 315,688 | 2019-02-02 15:44:49.13279
24 | Crypto Daily | https://www.youtube.com/channel/UC67AEEecqFEc92nVvcqKdhA | 121,342 | 2019-02-02 15:44:50.450665
25 | The Modern Investor | https://www.youtube.com/channel/UC-5HLi3buMzdxjdTdic3Aig | 122,226 | 2019-02-02 15:44:50.513322
26 | Tone Vays | https://www.youtube.com/channel/UCbiWJYRg8luWHnmNkJRZEnw | 82,589 | 2019-02-02 15:44:50.546499
27 | Crypt0 | https://www.youtube.com/user/obham001 | 119,040 | 2019-02-02 15:44:50.642958
28 | Ivan on Tech | https://www.youtube.com/user/LiljeqvistIvan | 195,249 | 2019-02-02 15:44:50.951154
29 | Decentralized TV | https://www.youtube.com/channel/UCueLJ4vLHTwMpYILmdBjRlg | 79,211 | 2019-02-02 15:44:51.191991
30 | Altcoin Buzz | https://www.youtube.com/channel/UCGyqEtcGQQtXyUwvcy7Gmyg | 210,546 | 2019-02-02 15:44:51.266842
31 | Alessio Rastani | https://www.youtube.com/user/alessiorastani | 176,027 | 2019-02-02 15:44:51.420558
32 | The Moon | https://www.youtube.com/channel/UCc4Rz_T9Sb1w5rqqo9pL1Og | 37,294 | 2019-02-02 15:44:52.020989
33 | RoadtoRoota | https://www.youtube.com/user/RoadtoRoota | 54,954 | 2019-02-02 15:44:52.177793
34 | TheChartGuys | https://www.youtube.com/channel/UCnqZ2hx679DqRi6khRUNw2g | 113,437 | 2019-02-02 15:44:52.245701
35 | Altcoin Daily | https://www.youtube.com/channel/UCbLhGKVY-bJPcawebgtNfbw | 62,538 | 2019-02-02 15:44:52.864349
36 | Crypto Zombie | https://www.youtube.com/channel/UCiUnrCUGCJTCC7KjuW493Ww | 46,716 | 2019-02-02 15:44:53.042814
37 | CryptosRUs | https://www.youtube.com/channel/UCI7M65p3A-D3P4v5qW8POxQ | 51,388 | 2019-02-02 15:44:53.246394
38 | Crypto Kirby Trading | https://www.youtube.com/channel/UCOaew10hdmtfa0MinTjOBqg | 31,333 | 2019-02-02 15:44:53.54117
39 | sunny decree | https://www.youtube.com/user/d3cr33 | 80,294 | 2019-02-02 15:44:54.288063
40 | Crypto Love | https://www.youtube.com/channel/UCu7Sre5A1NMV8J3s2FhluCw | 93,998 | 2019-02-02 15:44:54.591665
41 | Crypto Jebb | https://www.youtube.com/channel/UCviqt5aaucA1jP3qFmorZLQ | 17,531 | 2019-02-02 15:44:54.769744
42 | Chico Crypto | https://www.youtube.com/channel/UCHop-jpf-huVT1IYw79ymPw | 29,148 | 2019-02-02 15:44:55.791358
推荐答案
如果我正确理解您的话,则需要使用一些唯一的标识符来检查您的数据库中是否存在抓取的结果.
例如,您可以使用标题"列.
使用这种方法,您可以像这样修改process_item
方法:
If I understood you correctly - you need to use some unique identifier in order to check if scraped results are present in your DB.
For example, you can use 'title' column.
Using this approach you can amend your process_item
method like this:
class YtscraperPipeline(object):
def __init__(self):
#Initializes database connection and sessionmaker.
engine = db_connect()
create_channel_table(engine)
Session = sessionmaker(bind=engine)
self.session = Session()
def process_item(self, item, spider):
# check if item with this title exists in DB
item_exists = self.session.query(Channels).filter_by(title=item['title']).first()
# if item exists in DB - we just update 'date' and 'subs' columns.
if item_exists:
item_exists.date = item['date']
item_exists.subs = item['subs']
print('Item {} updated.'.format(item['title']))
# if not - we insert new item to DB
else:
new_item = Channels(**item)
self.session.add(new_item)
print('New item {} added to DB.'.format(item['title']))
return item
def close_spider(self, spider):
# We commit and save all items to DB when spider finished scraping.
try:
self.session.commit()
except:
self.session.rollback()
raise
finally:
self.session.close()
这篇关于Scrapy管道SQLAlchemy在输入DB之前检查项目是否存在?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!