无法通过scrapy获取表单 [英] Can't get through a form with scrapy

查看：242 发布时间：2018/3/5 13:59:18 python forms web-crawler scrapy

本文介绍了无法通过scrapy获取表单的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是使用scrapy的新手，我试图从一个房地产网站获取一些信息。
该网站有一个带搜索表单的主页（GET方法）。
我试图去我的start_requests（recherche.php）中的结果页面，并设置我在formdata参数中的地址栏中看到的所有获取参数。
我也设置了我的cookies，但他也没有工作。

I'm new with using scrapy and i'm trying to get some info from a real estate website. The site has a home page with a search form (method GET). I'm trying to go to the results page in my start_requests (recherche.php), and setting all the get parameters i see in the address bar in the formdata parameter. I also set up the cookies i had, but he didn't work either..

这是我的蜘蛛：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request

from robots_immo.items import AnnonceItem

class ElyseAvenueSpider(BaseSpider):
    name = "elyse_avenue"
    allowed_domains = ["http://www.elyseavenue.com/"]

    def start_requests(self):
        return [FormRequest(url="http://www.elyseavenue.com/recherche.php",
                            formdata={'recherche':'recherche',
                                      'compteurLigne':'2',
                                      'numLigneCourante':'0',
                                      'inseeVille_0':'',
                                      'num_rubrique':'',
                                      'rechercheOK':'recherche',
                                      'recherche_budget_max':'',
                                      'recherche_budget_min':'',
                                      'recherche_surface_max':'',
                                      'recherche_surface_min':'',
                                      'recherche_distance_km_0':'20',
                                      'recherche_reference_bien':'',
                                      'recherche_type_logement':'9',
                                      'recherche_ville_0':''
                                     },
                            cookies={'PHPSESSID':'4e1d729f68d3163bb110ad3e4cb8ffc3',
                                     '__utma':'150766562.159027263.1340725224.1340725224.1340727680.2',
                                     '__utmc':'150766562',
                                     '__utmz':'150766562.1340725224.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)',
                                     '__utmb':'150766562.14.10.1340727680'
                                    },
                            callback=self.parseAnnonces
                           )]



    def parseAnnonces(self, response):
        hxs = HtmlXPathSelector(response)
        annonces = hxs.select('//div[@id="contenuCentre"]/div[@class="blocVignetteBien"]')
        items = []
        for annonce in annonces:
            item = AnnonceItem()
            item['nom'] = annonce.select('span[contains(@class,"nomBienImmo")]/a/text()').extract()
            item['superficie'] = annonce.select('table//tr[2]/td[2]/span/text()').extract()
            item['prix'] = annonce.select('span[@class="prixVignette"]/span[1]/text()').extract()
            items.append(item)
        return items


SPIDER = ElyseAvenueSpider()

当我运行蜘蛛时，没有任何问题，但是加载的页面不是很好的（它说请指定您的搜索，我没有得到任何结果..） p>

When i run the spider, there is no problem, but the page loaded is not the good one (it's saying "Please specify your search" and i don't get any results..)

2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider opened
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-06-26 20:04:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-06-26 20:04:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-06-26 20:04:54+0200 [elyse_avenue] DEBUG: Crawled (200) <POST http://www.elyseavenue.com/recherche.php> (referer: None)
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Closing spider (finished)
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Dumping spider stats:
    {'downloader/request_bytes': 808,
     'downloader/request_count': 1,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 7590,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 6, 26, 18, 4, 54, 924624),
     'scheduler/memory_enqueued': 1,
     'start_time': datetime.datetime(2012, 6, 26, 18, 4, 54, 559230)}
2012-06-26 20:04:54+0200 [elyse_avenue] INFO: Spider closed (finished)
2012-06-26 20:04:54+0200 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 27410432, 'memusage/startup': 27410432}

感谢您的帮助！

推荐答案

我会使用 FormRequest.from_response（） 为您完成所有工作，因为您仍然可能会错过一些字段：从scrapy.spider导入BaseSpider
从scrapy.http导入HtmlXPathSelector
导入FormRequest中的

I would use FormRequest.from_response() which does all the job for you, as you could still miss some fields:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request

from robots_immo.items import AnnonceItem

class ElyseAvenueSpider(BaseSpider):

    name = "elyse_avenue"
    allowed_domains = ["elyseavenue.com"] # i fixed this
    start_urls = ["http://www.elyseavenue.com/"] # i added this

    def parse(self, response):
        yield FormRequest.from_response(response, formname='moteurRecherche', formdata={'recherche_distance_km_0':'20', 'recherche_type_logement':'9'}, callback=self.parseAnnonces)

    def parseAnnonces(self, response):
        hxs = HtmlXPathSelector(response)
        annonces = hxs.select('//div[@id="contenuCentre"]/div[@class="blocVignetteBien"]')
        items = []
        for annonce in annonces:
            item = AnnonceItem()
            item['nom'] = annonce.select('span[contains(@class,"nomBienImmo")]/a/text()').extract()
            item['superficie'] = annonce.select('table//tr[2]/td[2]/span/text()').extract()
            item['prix'] = annonce.select('span[@class="prixVignette"]/span[1]/text()').extract()
            items.append(item)
        return items

这篇关于无法通过scrapy获取表单的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法通过scrapy获取表单 [英] Can't get through a form with scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

无法通过scrapy获取表单 [英] Can&#39;t get through a form with scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

无法通过scrapy获取表单 [英] Can't get through a form with scrapy

登录关闭