使用scrapy从gsmarena页面中提取数据 [英] Extract data from a gsmarena page using scrapy

查看：59 发布时间：2021/7/16 22:17:51 python web-scraping scrapy scrapy-spider

本文介绍了使用scrapy从gsmarena页面中提取数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 gsmarena 页面下载数据:"http://www.gsmarena.com/htc_one_me-7275.php".

I'm trying to download data from a gsmarena page: "http://www.gsmarena.com/htc_one_me-7275.php".

但是数据以表格和表格行的形式分类.数据格式如下:

However the data is classified in form of tables and table rows. The data is of the format:

table header > td[@class='ttl'] > td[@class='nfo']

编辑代码:感谢 stackexchange 社区成员的帮助，我将代码重新格式化为:item.py 文件:

Edited code: Thanks to the help of community members at stackexchange, I've reformatted the code as: Items.py file:

import scrapy

class gsmArenaDataItem(scrapy.Item):
    phoneName = scrapy.Field()
    phoneDetails = scrapy.Field()
    pass

蜘蛛文件:

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
    name = "mobile_test"
    allowed_domains = ["gsmarena.com"]
    start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

    def parse(self, response):
        # extract whatever stuffs you want and yield items here
        hxs = Selector(response)
        phone = gsmArenaDataItem()
        tableRows = hxs.css("div#specs-list table")
        for tableRows in tableRows:
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                colonSign = ": "
                commaSign = ", "
                seq = [ttl_value, colonSign, nfo_value, commaSign]
                phone['phoneDetails'] = "".join(seq)
        yield phone

但是，一旦我尝试使用以下方法在scrapy shell中加载页面，我就会被禁止:

However, I'm getting banned as soon as I try to even load the page in scrapy shell using:

"http://www.gsmarena.com/htc_one_me-7275.php"

我什至尝试在 settings.py 中使用 DOWNLOAD_DELAY = 3.

I've even tried using DOWNLOAD_DELAY = 3 in settings.py.

请建议我应该怎么做.

使用scrapy从gsmarena页面中提取数据 [英] Extract data from a gsmarena page using scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用scrapy从gsmarena页面中提取数据 [英] Extract data from a gsmarena page using scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭