PYTHON SCRAPY无法将信息发布到FORMS, [英] PYTHON SCRAPY Can't POST information to FORMS,

查看:81
本文介绍了PYTHON SCRAPY无法将信息发布到FORMS,的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在几天内为这个问题苦苦挣扎时会问我一个很大的忙.我尝试了所有可能的方法(以我所知),但仍然没有结果.我做错了事,但仍然无法弄清楚那是什么.因此,感谢所有愿意参加此冒险活动的人. 首先要注意的是: 我正在尝试使用POST方法将信息发布到delta.com上的表单中 像往常一样,此网站的会话,Cookie和Javascript都很复杂,因此可能会出现问题. 我正在使用在stackoverflow中找到的代码示例: 使用MultipartPostHandler通过Python发布表单数据 这是我为增量网页调整的代码.

I think that I will ask very big favor as i struggling with this problem several days. I tried all possible (in my best knowledge) ways and still no result. I am doing something wrong, but still can't figure out what it is. So thank you every one who are willing enough to go to this adventure. First things first: I am trying to use POST method to post information to the form that is on delta.com As always with this websites it is complicated as they are in to the sessions and cookies and Javascript so it can be problem there. I am using code example that I found in stackoverflow: Using MultipartPostHandler to POST form-data with Python And here is my code that I tweaked for delta web page.

from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from delta.items import DeltaItem
from scrapy.contrib.spiders import CrawlSpider, Rule


class DmozSpider(CrawlSpider):
    name = "delta"
    allowed_domains = ["http://www.delta.com"]
    start_urls = ["http://www.delta.com"]

    def start_requests(self, response):
        yield FormRequest.from_response(response, formname='flightSearchForm',url="http://www.delta.com/booking/findFlights.do", formdata={'departureCity[0]':'JFK', 'destinationCity[0]':'SFO','departureDate[0]':'07.20.2013','departureDate[1]':'07.28.2013','paxCount':'1'},callback=self.parse1)

    def parse1(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//')
        items = []
        for site in sites:
            item = DeltaItem()
            item['title'] = site.select('text()').extract()
            item['link'] = site.select('text()').extract()
            item['desc'] = site.select('text()').extract()
            items.append(item)
        return items

当我指示蜘蛛在终端中爬行时,我会看到:

When I instruct spider to crawl in terminal I see:

 scrapy crawl delta -o items.xml  -t xml

2013-07-01 13:39:30+0300 [scrapy] INFO: Scrapy 0.16.2 started (bot: delta)
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-01 13:39:30+0300 [delta] INFO: Spider opened
2013-07-01 13:39:30+0300 [delta] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-01 13:39:30+0300 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 13:39:33+0300 [delta] DEBUG: Crawled (200) <GET http://www.delta.com> (referer: None)
2013-07-01 13:39:33+0300 [delta] INFO: Closing spider (finished)
2013-07-01 13:39:33+0300 [delta] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 219,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 27842,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2013, 7, 1, 10, 39, 33, 159235),
     'log_count/DEBUG': 7,
     'log_count/INFO': 4,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2013, 7, 1, 10, 39, 30, 734090)}
2013-07-01 13:39:33+0300 [delta] INFO: Spider closed (finished)

如果与链接中的示例进行比较,即使使用几乎相同的代码,我也看不到我设法制作了POST方法. 我什至尝试使用放置在服务器上的W3schools提供的非常简单的HTML/PHP表单,但在该服务器上也是如此.我以前从未尝试过创建POST. 我认为问题很简单,但是由于只有我掌握的Python知识是Scrapy,而所有Scrapy都是我在网上(根据文献证明)和示例中发现的内容,但对于我来说仍然不够.因此,如果至少有人可以显示正确的方法,那将是非常有帮助的.

If you compare with example from link I don't see that I managed to make POST method even when I am using almost the same code. I even tried with very simple HTML/PHP form from W3schools that I placed on server, but the same there. What ever I did never managed to create POST. I think the problem is simple, but as only Python knowledge that I have is Scrapy and all Scrapy is what i found on-line(I it is well documented) and from examples, but still it is not enough for me. So if any one at least could show the right way it would be very big help.

推荐答案

下面是将Request.from_response用作delta.com的一个有效示例:

Here's a working example of using Request.from_response for delta.com:

from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider


class DeltaItem(Item):
    title = Field()
    link = Field()
    desc = Field()


class DmozSpider(BaseSpider):
    name = "delta"
    allowed_domains = ["delta.com"]
    start_urls = ["http://www.delta.com"]

    def parse(self, response):
        yield FormRequest.from_response(response,
                                        formname='flightSearchForm',
                                        formdata={'departureCity[0]': 'JFK',
                                                  'destinationCity[0]': 'SFO',
                                                  'departureDate[0]': '07.20.2013',
                                                  'departureDate[1]': '07.28.2013'},
                                        callback=self.parse1)

    def parse1(self, response):
        print response.status

您使用了错误的Spider方法,并且错误地设置了allowed_domains.

You've used wrong spider methods, plus allowed_domains was incorrectly set.

但是,无论如何,delta.com大量使用动态ajax调用来加载内容-这是问题开始的地方.例如. parse1方法中的response不包含任何搜索结果-而是包含用于加载动态加载结果的AWAY WE GO. ARRIVING AT YOUR FLIGHTS SOON 页面的html.

But, anyway, delta.com heavily uses dynamic ajax calls for loading the content - here's where your problems start. E.g. response in parse1 method doesn't contain any search results - instead it contains an html for loading AWAY WE GO. ARRIVING AT YOUR FLIGHTS SOON page where results are loaded dynamically.

基本上,您应该使用浏览器开发人员工具,并尝试模拟Spider内部的那些Ajax调用,或使用,它使用了真正的浏览器(您可以将其与scrapy结合使用).

Basically, you should work with your browser developer tools and try to simulate those ajax calls inside your spider or use tools like selenium which uses the real browser (and you can combine it with scrapy).

另请参阅:

  • Scraping ajax pages using python
  • Can scrapy be used to scrape dynamic content from websites that are using AJAX?
  • Pagination using scrapy

希望有帮助.

这篇关于PYTHON SCRAPY无法将信息发布到FORMS,的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆