Scrapy并提交一个JavaScript表单 [英] Scrapy and submitting a javascript form

查看:161
本文介绍了Scrapy并提交一个JavaScript表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习scrapy,并且遇到了试图提交由javascript控制的表单的障碍。



我尝试过在堆栈溢出(Selenium)中找到的一些东西,但没有运气(出于多种原因)。



我需要的页面是...
http:// agmarknet .nic.in /



...并进行商品搜索。当我检查元素时,它看起来有一个m形式,并提交一个需要商品价值的cmm。

 << ; form name =mmethod =post> 
(...)
(...)
< input type =buttonvalue =Goname =Go3style =color:#000080; font-size:8pt; font-family:Arial ; font-weight:boldonclick =search1();>< / td>

任何建议都会受到感谢!



UPDATE :
我已经尝试过使用硒,但它没有找到或填充该字段。我也不介意在不弹出firefox窗口的情况下做到这一点......

  CrawlSpider .__ init __(self) 
self.verificationErrors = []

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox( )
driver.get(http://agmarknet.nic.in/)
time.sleep(4)
elem = driver.find_element_by_name(cmm)
elem.send_keys(banana)
time.sleep(5)
elem.send_keys(Keys.RETURN)
driver.close()



UPDATE:



我也尝试过以下各种迭代,但没有运气。当我从网页上提交搜索时,fiddler2告诉我它正在发送字符串cmm = banana& mkt =& search =...但是当我使用下面的代码时,fiddler告诉我什么都没有发布...
$ b $ pre $ class Agmarknet(Spider):
name =agmarknet
start_urls = [ http://agmarknet.nic.in/SearchCmmMkt.asp]


def parse(self,response):
return [FormRequest.from_response(
response ,
#formname =cmm1,
formdata = {
'method':'post',
'cmm':'banana',
'mkt' :'',
'search':''},
callback = self.after_search)]

def after_search(self):
print response.body

从上面输出:

  {'download_timeout':180,'下载_latency':13.44700002670288,'proxy':'http://127.0.0.1:8888','download_slot':'agmarknet.nic.in'} 
Spider错误处理< GET http://agmarknet.nic 。在/ SearchCmmMkt.asp>
Traceback(最近一次调用最后一次):
文件Z:\ WinPython-32bit-2.7.6.2 \python-2.7.6\lib\site-packages\twisted\internet \ base.py,行1201,在mainLoop
self.runUntilCurrent()
文件Z:\ WinPython-32bit-2.7.6.2\python-2.7.6\lib\ site-packages \ twisted\internet\base.py,第824行,在runUntilCurrent
call.func(* call.args,** call.kw)
文件Z:\ WinPython-32bit-2.7.6.2 \ python-2.7.6\lib\site-packages\twisted\internet\defer.py,第382行,回调函数
self._startRunCallbacks(result)
在_startRunCallbacks中的第490行文件Z:\ WinPython-32bit-2.7.6.2 \python-2.7.6\lib\site-packages\twisted\internet\defer.py
self._runCallbacks()
---<这里捕获的异常> ---
文件Z:\ WinPython-32bit-2.7.6.2 \python-2.7.6\lib\site-packages\twisted\internet\defer.py,第577行,在_runCallbacks中
current.result = callback(current.result,* args,** kw)
文件Z:\WindowsDocuments\eclipseworkspaces\BioCom\manoliagro-agmarknetscraper\src\\ \\ bin\agmarknetscraper\spiders\agmarknet.py,第34行,解析
callback = self.after_search)]
文件Z:\ WinPython-32bit-2.7.6.2\ python-2.7.6 \ lib \ site-packages \scrapy-0.22.0-py2.7.egg\scrapy\http\request\form.py,第36行,from_response
form = _get_form(response,formname,formnumber,formxpath)
文件Z:\ WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\\scrapy-0.22 .0-py2.7.egg \scrapy\http\request\form.py,第59行,在_get_form
中增加ValueError(在%s中找到的No< form>元素%响应)
exceptions.ValueError:No< form>元素在< 200中找到http://agmarknet.nic.in/SearchCmmMkt.asp>
SpiderRun完成


解决方案

两个框架,对源代码的简短浏览显示了它们的名称内容和主要。所以你的脚本几乎完成了这项工作,只是缺少一个指向与driver.switch_to_frame('main')一起被称为'main'的右边框。此外,表单不会对ENTER键作出反应,我们的确需要选择按钮并按下它: - )。



此代码正在工作:

 导入时间$ b $ se从selenium导入webdriver 
从selenium.webdriver.common.keys导入键

驱动程序= webdriver.Firefox()
driver.get(http://agmarknet.nic.in/)
time.sleep(4)

driver.switch_to_frame ('main')
textinput = driver.find_element_by_name('cmm')
textinput.send_keys(banana)
time.sleep(1)

button = driver.find_element_by_name(Go3)
button.click()
driver.close()


I'm learning scrapy and I've run into a snag attempting to submit a form that is controlled by javascript.

I've tried experimenting with a number of things found here on Stack Overflow including Selenium but having no luck (for a number of reasons).

The page I need to scrape is... http://agmarknet.nic.in/

...and do a commodities search. When I inspect elements it appears to have a form "m", with a filed "cmm" needing a commodity value.

<form name="m" method="post">
(...)
<input type="text" name="cmm" onchange="return validateName(document.m.cmm.value);" size="13">
(...)
<input type="button" value="Go" name="Go3" style="color: #000080; font-size: 8pt; font-family: Arial; font-weight: bold" onclick="search1();"></td>

Any advice gratefully accepted!

UPDATE: I've tried this with selenium, but it doesn't find or populate the field. I also wouldn't mind being able to do this without popping up a firefox window...

    CrawlSpider.__init__(self)
    self.verificationErrors = []

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys

    driver = webdriver.Firefox()
    driver.get("http://agmarknet.nic.in/")
    time.sleep(4)
    elem = driver.find_element_by_name("cmm")
    elem.send_keys("banana")
    time.sleep(5)
    elem.send_keys(Keys.RETURN)
    driver.close()        

UPDATE:

I've also tried various iterations of the following, but with no luck. When I submit the search from the web page, fiddler2 tells me it is post'ing the string "cmm=banana&mkt=&search="...but when I use the code below, fiddler tells me nothing is being posted...

class Agmarknet(Spider):
    name = "agmarknet"
    start_urls = ["http://agmarknet.nic.in/SearchCmmMkt.asp"]


    def parse(self, response):
        return [FormRequest.from_response(
                    response,
                   #formname = "cmm1", 
                    formdata={
                    'method':'post',
                    'cmm': 'banana', 
                    'mkt': '', 
                    'search':''},
                    callback=self.after_search)]

    def after_search(self):
        print response.body

OUTPUT FROM ABOVE:

{'download_timeout': 180, 'download_latency': 13.44700002670288, 'proxy': 'http://127.0.0.1:8888', 'download_slot': 'agmarknet.nic.in'}
Spider error processing <GET http://agmarknet.nic.in/SearchCmmMkt.asp>
Traceback (most recent call last):
  File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
    self.runUntilCurrent()
  File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\defer.py", line 382, in callback
    self._startRunCallbacks(result)
  File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\defer.py", line 490, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\twisted\internet\defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "Z:\WindowsDocuments\eclipseworkspaces\BioCom\manoliagro-agmarknetscraper\src\bin\agmarknetscraper\spiders\agmarknet.py", line 34, in parse
    callback=self.after_search)]
  File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\scrapy-0.22.0-py2.7.egg\scrapy\http\request\form.py", line 36, in from_response
    form = _get_form(response, formname, formnumber, formxpath)
  File "Z:\WinPython-32bit-2.7.6.2\python-2.7.6\lib\site-packages\scrapy-0.22.0-py2.7.egg\scrapy\http\request\form.py", line 59, in _get_form
    raise ValueError("No <form> element found in %s" % response)
exceptions.ValueError: No <form> element found in <200 http://agmarknet.nic.in/SearchCmmMkt.asp>
SpiderRun done

解决方案

Obviously the page consists of two frames, a short glance at the source reveals their names 'contents' and 'main'. So your script above nearly does the job, merely missing a single line pointing to the right frame called 'main' with driver.switch_to_frame('main'). Also the form does not react to ENTER-key, we indeed have to select the button and press it :-).

This code is working:

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://agmarknet.nic.in/")
time.sleep(4)

driver.switch_to_frame('main')
textinput = driver.find_element_by_name('cmm')
textinput.send_keys("banana")
time.sleep(1)

button = driver.find_element_by_name("Go3")
button.click()
driver.close()

这篇关于Scrapy并提交一个JavaScript表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆