使用scrapy从gsmarena页面中提取数据 [英] Extract data from a gsmarena page using scrapy

查看:59
本文介绍了使用scrapy从gsmarena页面中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 gsmarena 页面下载数据:"http://www.gsmarena.com/htc_one_me-7275.php".

I'm trying to download data from a gsmarena page: "http://www.gsmarena.com/htc_one_me-7275.php".

但是数据以表格和表格行的形式分类.数据格式如下:

However the data is classified in form of tables and table rows. The data is of the format:

table header > td[@class='ttl'] > td[@class='nfo']

编辑代码:感谢 stackexchange 社区成员的帮助,我将代码重新格式化为:item.py 文件:

Edited code: Thanks to the help of community members at stackexchange, I've reformatted the code as: Items.py file:

import scrapy

class gsmArenaDataItem(scrapy.Item):
    phoneName = scrapy.Field()
    phoneDetails = scrapy.Field()
    pass

蜘蛛文件:

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
    name = "mobile_test"
    allowed_domains = ["gsmarena.com"]
    start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

    def parse(self, response):
        # extract whatever stuffs you want and yield items here
        hxs = Selector(response)
        phone = gsmArenaDataItem()
        tableRows = hxs.css("div#specs-list table")
        for tableRows in tableRows:
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                colonSign = ": "
                commaSign = ", "
                seq = [ttl_value, colonSign, nfo_value, commaSign]
                phone['phoneDetails'] = "".join(seq)
        yield phone

但是,一旦我尝试使用以下方法在scrapy shell中加载页面,我就会被禁止:

However, I'm getting banned as soon as I try to even load the page in scrapy shell using:

"http://www.gsmarena.com/htc_one_me-7275.php"

我什至尝试在 settings.py 中使用 DOWNLOAD_DELAY = 3.

I've even tried using DOWNLOAD_DELAY = 3 in settings.py.

请建议我应该怎么做.

推荐答案

我也面临同样的问题,即在少数请求中被禁止,使用 scrapy-proxies 并使用 自动节流有很大帮助,但并没有完全解决问题.

I also faced the same problem of getting banned within few requests, changing proxies using scrapy-proxies and using autothrottling helped significantly, but did not solve the problem completely.

您可以在 gsmarenacrawler

这篇关于使用scrapy从gsmarena页面中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆