Python Scrapy 并不总是从网站下载数据 [英] Python Scrapy not always downloading data from website

查看:57
本文介绍了Python Scrapy 并不总是从网站下载数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scrapy 用于解析 html 页面.我的问题是为什么有时scrapy 返回我想要的响应,但有时不返回响应.是我的错吗?这是我的解析函数:

Scrapy is used to parse an html page. My question is why sometimes scrapy returns the response I want, but sometimes does not return a response. Is it my fault? Here's my parsing function:

class AmazonSpider(BaseSpider):
    name = "amazon"
    allowed_domains = ["amazon.org"]
    start_urls = [
       "http://www.amazon.com/s?rh=n%3A283155%2Cp_n_feature_browse-bin%3A2656020011"
   ]

def parse(self, response):
            sel = Selector(response)
            sites = sel.xpath('//div[contains(@class, "result")]')
            items = []
            titles = {'titles': sites[0].xpath('//a[@class="title"]/text()').extract()}
            for title in titles['titles']:
                item = AmazonScrapyItem()
                item['title'] = title
                items.append(item)
            return items

推荐答案

我相信您只是没有使用最合适的 XPath 表达式.

I believe you are just not using the most adequate XPath expression.

亚马逊的 HTML 有点乱,不是很统一,因此不太容易解析.但是经过一些实验,我可以使用以下 parse 函数提取几个搜索结果的所有 12 个标题:

Amazon's HTML is kinda messy, not very uniform and therefore not very easy to parse. But after some experimenting I could extract all the 12 titles of a couple of search results with the following parse function:

def parse(self, response):
    sel = Selector(response)
    p = sel.xpath('//div[@class="data"]/h3/a')
    titles = p.xpath('span/text()').extract() + p.xpath('text()').extract()
    items = []
    for title in titles:
        item = AmazonScrapyItem()
        item['title'] = title
        items.append(item)
    return items

如果您关心结果的实际顺序,上面的代码可能不合适,但我相信事实并非如此.

If you care about the actual order of the results the above code might not be appropriate but I believe that is not the case.

这篇关于Python Scrapy 并不总是从网站下载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆