scrapy 如何抓取更多的 url? [英] How can scrapy crawl more urls?

查看:60
本文介绍了scrapy 如何抓取更多的 url?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如我们所见:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.select('a/text()').extract()
        item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
        item['description'] = site.select('text()').extract()
        items.append(item)

    return items

scrapy 只是获取页面响应,并在页面响应中找到 url.我认为这只是表面爬行!!

scrapy just get a page response,and find urls in the page response. I think it is just a surface crawl !!

但我想要更多具有定义深度的网址.

But I want more urls with the definded depth .

我能做些什么来实现它??

what can I do to implement it ??

谢谢!!

推荐答案

我没看懂你的问题,但我注意到你的代码中有几个问题,其中一些可能与你的问题有关(请参阅代码中的注释):

I did not understand your question, but I noticed several problems in your code, some of them could be related to your question (see the comments in the code):

sites = hxs.select('//ul/li')
items = []

for site in sites:
    item = Website()
    # this extracts a list, so i guess .extract()[0] is expected
    item['name'] = site.select('a/text()').extract() 
    # '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'.
    # And, again, this returns a list, not a single url.
    item['url'] = site.select('//a[contains(@href, "http")]/@href').extract() 

这篇关于scrapy 如何抓取更多的 url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆