如何根据现有的JSON列表防止Scrapy提取上的重复项 [英] How to prevent duplicates on Scrapy fetching depending on an existing JSON list

查看:130
本文介绍了如何根据现有的JSON列表防止Scrapy提取上的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这只蜘蛛

import scrapy

class RedditSpider(scrapy.Spider):
    name = 'Reddit'
    allowed_domains = ['reddit.com']
    start_urls = ['https://old.reddit.com']

    def parse(self, response):

        for link in response.css('li.first a.comments::attr(href)').extract():
            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)



    def parse_topics(self, response):
        topics = {}
        topics["title"] = response.css('a.title::text').extract_first()
        topics["author"] = response.css('p.tagline a.author::text').extract_first()

        if response.css('div.score.likes::attr(title)').extract_first() is not None:
            topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
        else:
            topics["score"] = "0"

        if int(topics["score"]) > 10000:
            author_url = response.css('p.tagline a.author::attr(href)').extract_first()
            yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
        else:
            yield topics

    def parse_user(self, response):
        topics = response.meta.get('topics')

        users = {}
        users["name"] = topics["author"]
        users["karma"] = response.css('span.karma::text').extract_first()

        yield users
        yield topics

我得到这些结果:

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
  ....
]

,但是我每天都会运行这只Spider来获得本周的最后一天,因此,例如,如果今天是一周的第7天,我会像今天这样在今天之前的6天获得重复

, But I run this Spider everyday to get the last of this week, So if for example today is the 7th day of the week, I get a duplicate of 6 days before within today like this

day1: result_day1
day2: result_day2, result_day1
day3: result_day3, result_day2, result_day1
. . . . . . .
day7: result_day7, result_day6, result_day5, result_day4, result_day3, result_day2, result_day1

所有数据都存储在 JSON 文件中,如前所示,我要做的是告诉Spider检查 JSON 文件,如果是,则跳过它,如果不是,则将其添加到文件中,

All the data is stored in a JSON file as shown before, What I want to do is to tell the Spider to check of the fetched result already exists in the JSON file, If it is, Then it skips it, If it is not, then it is added to the file,

使用Scrapy可以吗?

Is that possible using Scrapy?

例如:

如果昨天(06.json)的结果是

if yesterday (06.json) results was

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
]

今天(07.json)的结果是

And today (07.json) results are

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
  {"title": "ExampleTitle5", "author": "Username5", "score": "16700"}
]

今天的列表(07.json)的结果

the Result of today's List (07.json) to be

[
  {"title": "ExampleTitle5", "author": "Username5", "score": "16700"}
]

过滤后

推荐答案

Scrapy实际上仅提供一种寻找重复项"的方式(对于数据,不是重复的请求):使用项目管道中的项目并使用重复的过滤器.参见:

Scrapy provides really only one way it looks for 'duplicates' (for data, not duplicate requests made): collecting data by using items in an Item Pipeline and using a duplicate filter. See:

https://doc.scrapy.org /en/latest/topics/item-pipeline.html#duplicates-filter

当检测到重复项时,它会丢弃项目.这种方法有两个问题:(1)您必须编写重复过滤器方法,以根据使用的数据定义重复的 ,以及(2)此方法仅对以下方面有所帮助在蜘蛛的相同运行"中检查重复项.

It drops items when a duplicate is detected. I have two problems with this approach: (1) you have to write the duplicate filter method to define what a duplicate is based on the data your working with, and (2) this method really only helps for checking duplicates in the same 'run' of the spider.

另一种在几天之间运行Spider的方法是在两次运行之间保留数据.参见:

An alternate approach for running the spider between days is to persist data between runs. See:

https://doc .scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between批次

使用这种方法,您的spider.state将是来自上次运行(前一天)的数据.然后,当您再次运行Spider时,您知道从上次运行中获得了哪些数据.因此,您可以实施逻辑以将仅 unique 的数据提取到当前日期(为每天的数据加上时间戳,并使用最后一天作为比较).您可以快速实现这一点.并且,这可能足以解决您的问题.

Using this approach, your spider.state would be the data from the last run (from the previous day). Then, when you run the spider again, you know what data you got from the last run. So, you can implement logic to pull data that is only unique to the current day (timestamp the data for each day and use the last day as a comparison). You could quickly implement this. And, this might be good enough to solve your issue.

但是,如果您必须在当天之前的所有天中比较数据,则这种方法将变得不灵活.这意味着您将使Spider在当前数据之前一周的所有天中都保留数据.因此,例如,您的spider.state字典(这只是每天的JSON结果)将变得非常大,因为它充满了第7天之前所有天的数据.

But, this approach would get unruly if you had to compare data over the course of all days prior to the current day. This means that you would make your spider persist data for all days in the week prior to the current one. So, your spider.state dictionary (which would just be the JSON results for each day) would get really large as it is filled with the data from all days prior to day 7, for example.

如果需要确保当天添加的数据与前一天相比是唯一的,我将完全放弃Scrapy的内置机制.我只是将所有数据写入带有刮取数据时间的时间戳的数据库.然后,您可以使用数据库查询来找出每天添加的唯一数据.

If you need to make that the data added for the current day is unique compared to all days before it, I would ditch Scrapy's built in mechanisms entirely. I would just write all the data to a database with timestamps of when the data was scraped. You could then use database queries to find out what unique data was added for each individual day.

这篇关于如何根据现有的JSON列表防止Scrapy提取上的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆