抓取项目加载器从提取的 url 中获取绝对 url [英] scrapy item loader to get a absolute url from extracted url

查看：34 发布时间：2021/7/16 22:14:14 python web-scraping scrapy

本文介绍了抓取项目加载器从提取的 url 中获取绝对 url的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用/学习 scrapy，python 框架来抓取我感兴趣的一些网页.在那个过程中，我提取了页面中的链接.但在大多数情况下，这些链接是相对的.我使用了 scrapy.utils.url 中存在的 urljoin_rfc 来获取绝对路径.效果很好.

I am using/learning scrapy, python framework to scrape few of my interested web pages. In that go I extract the links in a page. But those links are relative in most of the case. I used urljoin_rfc which is present in scrapy.utils.url to get the absolute path. It worked fine.

在学习过程中，我遇到了一个名为 Item Loader 的功能.现在我想使用 Item loader 做同样的事情.我的 urljoin_rfc() 在用户定义的函数函数 _urljoin(url,response) 中.我希望我的加载器现在引用函数 _urljoin.所以在我的加载器类中，我执行 link_in = _urljoin().因此，我将 _urljoin 声明更改为 _urljoin(url, response = loader_context.response).但是我收到一条错误消息，说 NameError: name 'loader_context' is not defined

In a process of learning I came across a feature called Item Loader. Now I want to do the same using Item loader. My urljoin_rfc() is in a user defined function function _urljoin(url,response). I want my loader to refer the function _urljoin now. So in my loader class I do link_in = _urljoin(). So I canged my _urljoin declaration to _urljoin(url, response = loader_context.response). But I get a error saying NameError: name 'loader_context' is not defined

我需要帮助.我这样做是因为，不仅仅是在加载时我调用 _urljoin()，我的代码的其他部分也调用函数 _urljoin.如果我做得很糟糕，请提请我注意.

I need help here. I do this because, not just while loading I call _urljoin(), other part of my code too call the function _urljoin. If i am terribly doing bad please bring it to my notice.

推荐答案

如果你在其他地方使用 _urljoin(url, response)，你可以保持原样，接受响应作为第二个参数.

If you're using _urljoin(url, response) elsewhere, you can keep as it is, accepting a response as 2nd argument.

现在，Item Loaders 的处理器可以接受一个上下文，但是上下文是在所有输入和输出处理器之间共享的任意键/值的字典(来自文档).

Now, processors for Item Loaders can accept a context, but the context is a dict of arbitrary key/values which is shared among all input and output processors (from the docs).

所以你可以让包装函数调用你的_urljoin(url, response):

So you could have wrapping function calling your _urljoin(url, response):

def urljoin_w_context(url, loader_context):
    response = loader_context.get('response')
    return _urljoin(url, response)

并在您的 ItemLoader 定义中:

    ...
    link_in = MapCompose(urljoin_w_context)
    ...

最后在您的回调代码中，当您实例化 ItemLoader 时，传递响应引用:

and finally in your callback code, when you instantiate your ItemLoader, pass the response reference:

def parse_something(self, response):
    ...
    loader = ItemLoader(item, response=response)
    ...

这篇关于抓取项目加载器从提取的 url 中获取绝对 url的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

抓取项目加载器从提取的 url 中获取绝对 url [英] scrapy item loader to get a absolute url from extracted url

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

抓取项目加载器从提取的 url 中获取绝对 url [英] scrapy item loader to get a absolute url from extracted url

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭