在Django视图中获取scrapy结果 [英] Get scrapy result inside a Django view

查看：45 发布时间：2021/7/16 22:12:19 django scrapy

本文介绍了在Django视图中获取scrapy结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我成功抓取了一个页面，该页面返回了一个独特的项目.我既不想将报废的项目保存在数据库中，也不想保存到文件中.我需要将它放入 Django 视图中.

I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view.

我的观点如下:

def start_crawl(process_number, court):
    """
    Starts the crawler.

        Args:
            process_number (str): Process number to be found.
            court (str): Court of the process.
    """
    runner = CrawlerRunner(get_project_settings())
    results = list()

    def crawler_results(sender, parse_result, **kwargs):
        results.append(parse_result)

    dispatcher.connect(crawler_results, signal=signals.item_passed)
    process_info = runner.crawl(MySpider, process_number=process_number, court=court)

    return results

我关注了这个解决方案，但结果列表始终为空.

I followed this solution but results list is always empty.

我读了一些关于创建自定义中间件并在 process_spider_output 方法中获取结果的内容.

I read something as creating a custom middleware and getting the results at the process_spider_output method.

我怎样才能得到想要的结果?

How can I get the desired result?

谢谢！

推荐答案

我设法在我的一个项目中实现了类似的东西.这是一个小型项目，我一直在寻找快速解决方案.如果你把它放在生产环境中，你可能需要修改它或支持多线程等.

I managed to implement something like that in one of my projects. It is a mini-project and I was looking for a quick solution. You'll might need modify it or support multi-threading etc in case you put it in production environment.

我创建了一个 ItemPipeline，它只是将项目添加到 InMemoryItemStore 助手中.然后，在我的 __main__ 代码中，我等待爬虫完成，并从 InMemoryItemStore 中弹出所有项目.然后我就可以随心所欲地操作这些项目了.

I created an ItemPipeline that just add the items into a InMemoryItemStore helper. Then, in my __main__ code I wait for the crawler to finish, and pop all the items out of the InMemoryItemStore. Then I can manipulate the items as I wish.

hacky 内存存储.它不是很优雅，但它为我完成了工作.如果您愿意，可以修改和改进.我已经将它实现为一个简单的类对象，因此我可以简单地将它导入到项目中的任何位置并使用它而无需传递其实例.

Hacky in-memory store. It is not very elegant but it got the job done for me. Modify and improve if you wish. I've implemented that as a simple class object so I can simply import it anywhere in the project and use it without passing its instance around.

class InMemoryItemStore(object):
    __ITEM_STORE = None

    @classmethod
    def pop_items(cls):
        items = cls.__ITEM_STORE or []
        cls.__ITEM_STORE = None
        return items

    @classmethod
    def add_item(cls, item):
        if not cls.__ITEM_STORE:
            cls.__ITEM_STORE = []
        cls.__ITEM_STORE.append(item)

pipelines.py
这个管道将把上面代码片段中的对象存储在内存中.所有项目都简单地返回以保持常规管道流完好无损.如果您不想将某些项目向下传递到其他管道，只需将 `process_item` 更改为不返回所有项目.

pipelines.py

This pipleline will store the objects in the in-memory store from the snippet above. All items are simply returned to keep the regular pipeline flow intact. If you don't want to pass some items down the to the other pipelines simply change process_item to not return all items.

from <your-project>.items_store import InMemoryItemStore


class StoreInMemoryPipeline(object):
    """Add items to the in-memory item store."""
    def process_item(self, item, spider):
        InMemoryItemStore.add_item(item)
        return item

settings.py
现在在刮板设置中添加 `StoreInMemoryPipeline`.如果您更改了上面的 `process_item` 方法，请确保在此处设置了正确的优先级(此处将 100 更改为向下).

settings.py

Now add the StoreInMemoryPipeline in the scraper settings. If you change the process_item method above, make sure you set the proper priority here (changing the 100 down here).

ITEM_PIPELINES = {
   ...
   '<your-project-name>.pipelines.StoreInMemoryPipeline': 100,
   ...
}

main.py
这是我将所有这些事情联系在一起的地方.我清理内存存储，运行爬虫，并获取所有项目.

main.py

This is where I tie all these things together. I clean the in-memory store, run the crawler, and fetch all the items.

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from <your-project>.items_store import InMemoryItemStore
from <your-project>.spiders.your_spider import YourSpider

def get_crawler_items(**kwargs):
    InMemoryItemStore.pop_items()

    process = CrawlerProcess(get_project_settings())
    process.crawl(YourSpider, **kwargs)
    process.start()  # the script will block here until the crawling is finished
    process.stop()
    return InMemoryItemStore.pop_items()

if __name__ == "__main__":
    items = get_crawler_items()

这篇关于在Django视图中获取scrapy结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Django视图中获取scrapy结果 [英] Get scrapy result inside a Django view

问题描述

推荐答案

pipelines.py
这个管道将把上面代码片段中的对象存储在内存中.所有项目都简单地返回以保持常规管道流完好无损.如果您不想将某些项目向下传递到其他管道，只需将 `process_item` 更改为不返回所有项目.

pipelines.py

settings.py
现在在刮板设置中添加 `StoreInMemoryPipeline`.如果您更改了上面的 `process_item` 方法，请确保在此处设置了正确的优先级(此处将 100 更改为向下).

settings.py

main.py
这是我将所有这些事情联系在一起的地方.我清理内存存储，运行爬虫，并获取所有项目.

main.py

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Django视图中获取scrapy结果 [英] Get scrapy result inside a Django view

问题描述

推荐答案

pipelines.py这个管道将把上面代码片段中的对象存储在内存中.所有项目都简单地返回以保持常规管道流完好无损.如果您不想将某些项目向下传递到其他管道，只需将 process_item 更改为不返回所有项目.

pipelines.py

settings.py现在在刮板设置中添加 StoreInMemoryPipeline.如果您更改了上面的 process_item 方法，请确保在此处设置了正确的优先级(此处将 100 更改为向下).

settings.py

main.py这是我将所有这些事情联系在一起的地方.我清理内存存储，运行爬虫，并获取所有项目.

main.py

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

pipelines.py
这个管道将把上面代码片段中的对象存储在内存中.所有项目都简单地返回以保持常规管道流完好无损.如果您不想将某些项目向下传递到其他管道，只需将 `process_item` 更改为不返回所有项目.

settings.py
现在在刮板设置中添加 `StoreInMemoryPipeline`.如果您更改了上面的 `process_item` 方法，请确保在此处设置了正确的优先级(此处将 100 更改为向下).

main.py
这是我将所有这些事情联系在一起的地方.我清理内存存储，运行爬虫，并获取所有项目.

登录关闭