scrapy 如何抓取更多的 url? [英] How can scrapy crawl more urls?

查看：60 发布时间：2021/7/16 22:14:11 python scrapy

本文介绍了scrapy 如何抓取更多的 url?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

正如我们所见:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.select('a/text()').extract()
        item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
        item['description'] = site.select('text()').extract()
        items.append(item)

    return items

scrapy 只是获取页面响应，并在页面响应中找到 url.我认为这只是表面爬行！！

scrapy just get a page response,and find urls in the page response. I think it is just a surface crawl ！！

但我想要更多具有定义深度的网址.

But I want more urls with the definded depth .

我能做些什么来实现它??

what can I do to implement it ??

谢谢！！

推荐答案

我没看懂你的问题，但我注意到你的代码中有几个问题，其中一些可能与你的问题有关(请参阅代码中的注释):

I did not understand your question, but I noticed several problems in your code, some of them could be related to your question (see the comments in the code):

sites = hxs.select('//ul/li')
items = []

for site in sites:
    item = Website()
    # this extracts a list, so i guess .extract()[0] is expected
    item['name'] = site.select('a/text()').extract() 
    # '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'.
    # And, again, this returns a list, not a single url.
    item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()

这篇关于scrapy 如何抓取更多的 url?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

scrapy 如何抓取更多的 url? [英] How can scrapy crawl more urls?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

scrapy 如何抓取更多的 url? [英] How can scrapy crawl more urls?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭