使用scrapy从gsmarena页面中提取数据 [英] Extract data from a gsmarena page using scrapy
问题描述
我正在尝试从 gsmarena 页面下载数据:"http://www.gsmarena.com/htc_one_me-7275.php".
I'm trying to download data from a gsmarena page: "http://www.gsmarena.com/htc_one_me-7275.php".
但是数据以表格和表格行的形式分类.数据格式如下:
However the data is classified in form of tables and table rows. The data is of the format:
table header > td[@class='ttl'] > td[@class='nfo']
编辑代码:感谢 stackexchange 社区成员的帮助,我将代码重新格式化为:item.py 文件:
Edited code: Thanks to the help of community members at stackexchange, I've reformatted the code as: Items.py file:
import scrapy
class gsmArenaDataItem(scrapy.Item):
phoneName = scrapy.Field()
phoneDetails = scrapy.Field()
pass
蜘蛛文件:
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class testSpider(Spider):
name = "mobile_test"
allowed_domains = ["gsmarena.com"]
start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)
def parse(self, response):
# extract whatever stuffs you want and yield items here
hxs = Selector(response)
phone = gsmArenaDataItem()
tableRows = hxs.css("div#specs-list table")
for tableRows in tableRows:
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
colonSign = ": "
commaSign = ", "
seq = [ttl_value, colonSign, nfo_value, commaSign]
phone['phoneDetails'] = "".join(seq)
yield phone
但是,一旦我尝试使用以下方法在scrapy shell中加载页面,我就会被禁止:
However, I'm getting banned as soon as I try to even load the page in scrapy shell using:
"http://www.gsmarena.com/htc_one_me-7275.php"
我什至尝试在 settings.py 中使用 DOWNLOAD_DELAY = 3.
I've even tried using DOWNLOAD_DELAY = 3 in settings.py.
请建议我应该怎么做.
推荐答案
我也面临同样的问题,即在少数请求中被禁止,使用 scrapy-proxies 并使用 自动节流有很大帮助,但并没有完全解决问题.
I also faced the same problem of getting banned within few requests, changing proxies using scrapy-proxies and using autothrottling helped significantly, but did not solve the problem completely.
您可以在 gsmarenacrawler
这篇关于使用scrapy从gsmarena页面中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!