scrapy 如何抓取更多的 url? [英] How can scrapy crawl more urls?
本文介绍了scrapy 如何抓取更多的 url?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
正如我们所见:
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = Website()
item['name'] = site.select('a/text()').extract()
item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
item['description'] = site.select('text()').extract()
items.append(item)
return items
scrapy 只是获取页面响应,并在页面响应中找到 url.我认为这只是表面爬行!!
scrapy just get a page response,and find urls in the page response. I think it is just a surface crawl !!
但我想要更多具有定义深度的网址.
But I want more urls with the definded depth .
我能做些什么来实现它??
what can I do to implement it ??
谢谢!!
推荐答案
我没看懂你的问题,但我注意到你的代码中有几个问题,其中一些可能与你的问题有关(请参阅代码中的注释):
I did not understand your question, but I noticed several problems in your code, some of them could be related to your question (see the comments in the code):
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = Website()
# this extracts a list, so i guess .extract()[0] is expected
item['name'] = site.select('a/text()').extract()
# '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'.
# And, again, this returns a list, not a single url.
item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
这篇关于scrapy 如何抓取更多的 url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文