Scrapy CrawlSpider + Splash:如何通过链接提取器跟踪链接? [英] Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

查看:85
本文介绍了Scrapy CrawlSpider + Splash:如何通过链接提取器跟踪链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码部分工作,

I have the following code that is partially working,

class ThreadSpider(CrawlSpider):
    name = 'thread'
    allowed_domains = ['bbs.example.com']
    start_urls = ['http://bbs.example.com/diy']

    rules = (
        Rule(LinkExtractor(
            allow=(),
            restrict_xpaths=("//a[contains(text(), 'Next Page')]")
        ),
            callback='parse_item',
            process_request='start_requests',
            follow=True),
    )

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

def parse_item(self, response):
    # item parser

代码将仅针对 start_urls 运行,但不会遵循 restricted_xpaths 中指定的链接,如果我注释掉 start_requests() 方法和规则中的 process_request='start_requests', 行,它将按预期运行并跟踪链接,当然无需 js 渲染.

the code will run only for start_urls but will not follow the links specified in restricted_xpaths, if i comment out start_requests() method and the line process_request='start_requests', in the rules, it will run and follow links at intended, of course without js rendering.

我已经阅读了两个相关的问题,CrawlSpider 与 Splash 卡住了在第一个 URLCrawlSpider with Splash 之后,并特别更改了 scrapy.Request()start_requests() 方法中的 SplashRequest() ,但这似乎不起作用.我的代码有什么问题?谢谢,

I have read the two related questions, CrawlSpider with Splash getting stuck after first URL and CrawlSpider with Splash and specifically changed scrapy.Request() to SplashRequest() in the start_requests() method, but that does not seem to work. What is wrong with my code? Thanks,

推荐答案

我遇到了一个类似的问题,似乎是将 Splash 与 Scrapy CrawlSpider 集成在一起.它只会访问起始 url,然后关闭.我设法让它工作的唯一方法是不使用scrapy-splash插件,而是使用'process_links'方法将Splash http api url添加到scrapy收集的所有链接中.然后我做了其他调整来弥补这种方法产生的新问题.这是我所做的:

I've had a similar issue that seemed specific to integrating Splash with a Scrapy CrawlSpider. It would visit only the start url and then close. The only way I managed to get it to work was to not use the scrapy-splash plugin and instead use the 'process_links' method to preppend the Splash http api url to all of the links scrapy collects. Then I made other adjustments to compensate for the new issues that arise from this method. Here's what I did:

如果您打算将其存储在某处,则您需要这两个工具来将初始 url 放在一起,然后将其拆开.

You'need these two tools to put together the splash url and then take it apart if you intend to store it somewhere.

from urllib.parse import urlencode, parse_qs

在每个链接前面加上初始 url 后,scrapy 会将它们全部过滤为站外域请求",因此我们将localhost"设为允许域.

With the splash url being preppended to every link, scrapy will filter them all out as 'off site domain requests', so we make make 'localhost' the allowed domain.

allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']

然而,这带来了一个问题,因为当我们只想抓取一个网站时,我们可能最终会无休止地抓取网络.让我们用 LinkExtractor 规则解决这个问题.通过只从我们想要的域中抓取链接,我们可以解决异地请求问题.

However, this poses a problem because then we may end up endlessly crawling the web when we only want to crawl one site. Let's fix this with the LinkExtractor rules. By only scraping links from our desired domain, we get around the offsite request problem.

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',

这是 process_links 方法.urlencode 方法中的字典是您放置所有启动参数的地方.

Here's the process_links method. The dictionary in the urlencode method is where you'll put all of your splash arguments.

def process_links(self, links):
    for link in links:
        if "http://localhost:8050/render.html?&" not in link.url:
            link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                          'wait':2.0})
    return links

最后,要将 url 从初始 url 中取出,请使用 parse_qs 方法.

Finally, to take the url back out of the splash url, use the parse_qs method.

parse_qs(response.url)['url'][0] 

关于这种方法的最后一点.你会注意到我有一个&"在开头的初始网址中.(...render.html?&).这使得在使用 urlencode 方法时,无论参数的顺序如何,解析初始 url 以取出实际 url 都是一致的.

One final note about this approach. You'll notice that I have an '&' in the splash url right at the beginning. (...render.html?&). This makes parsing the splash url to take out the actual url consistent no matter what order you have the arguments when you're using the urlencode method.

这篇关于Scrapy CrawlSpider + Splash:如何通过链接提取器跟踪链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆