Web Scraper用于python中的动态表单 [英] Web Scraper for dynamic forms in python

查看:253
本文介绍了Web Scraper用于python中的动态表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试填写此网站的表单 http://www.marutisuzuki.com/Maruti-Price.aspx .

I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx.

它包含三个下拉列表.一是汽车的模型,二是州,三是城市.前两个是静态的,第三个是city,它是根据state的值动态生成的,正在运行一个onclick Java脚本事件,该事件获取一个州中相应城市的值.

It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state, there is an onclick java script event running which gets the values of corresponding cities in a state.

我熟悉python中的机械化模块.我遇到了几个链接,这些链接告诉我,我无法在机械化中处理动态内容.但是此链接 http://toddhayton.com/2014 /12/08/form-handling-with-mechanize-and-beautifulsoup/在"动态添加项目"部分中指出,我可以使用机械化来处理动态内容,但我没有了解其中的这一行代码

I am familiar with mechanize module in python. I came across several links telling me that I cannot handle dynamic content in mechanize. But this link http://toddhayton.com/2014/12/08/form-handling-with-mechanize-and-beautifulsoup/ in the section "Adding item dynamically" states that I can use mechanize to handle dynamic content but I did not understand this line of code in it

item = Item(br.form.find_control(name='searchAuxCountryID'),{'contents': '3', 'value': '3', 'label': 3})

此代码行中与表格中城市字段相对应的项目"是什么.我遇到了硒模块,它可以帮助我处理动态下拉列表.但是我无法在其文档中找到任何有关如何使用它的东西,也找不到任何好的博客.

What is "Item" in this line of code corresponding to the city field in the form. I came across selenium module which might help me handling dynamic drop down list. But I was not able to find anything in its documentation or any good blog on how to use it.

有人可以建议我如何针对不同的模型,州和城市提交此表格吗?任何有关如何解决此问题的链接将不胜感激. python中有关如何提交表单的示例代码将很有帮助.预先感谢.

Can some one suggest me how to submit this form for different models, states and cities? Any links on how to solve this problem will be appreciated. A sample code in python on how to submit the form will be helpful. Thanks in advance.

推荐答案

如果您在开发人员工具中查看发送到该站点的请求,则将看到在选择状态后立即发送了POST.发送回的响应具有填入城市下拉列表中的值的形式.

If you look at the request being sent to that site in developer tools, you'll see that a POST is sent as soon as you select a state. The response that is sent back has the form with the values in the city dropdown populated.

因此,要将其复制到脚本中,您需要以下内容:

So, to replicate this in your script you want something like the following:

  • 打开页面
  • 选择表格
  • 选择模型和状态的值
  • 提交表格
  • 从发送回的响应中选择表格
  • 为城市选择值(应立即填写)
  • 提交表格
  • 解析结果表的响应

这看起来像:

#!/usr/bin/env python                                                                                                                                                                

import re
import mechanize

from bs4 import BeautifulSoup

def select_form(form):
    return form.attrs.get('id', None) == 'form1'

def get_state_items(browser):
    browser.select_form(predicate=select_form)
    ctl = browser.form.find_control('ctl00$ContentPlaceHolder1$ddlState')
    state_items = ctl.get_items()
    return state_items[1:]

def get_city_items(browser):
    browser.select_form(predicate=select_form)
    ctl = browser.form.find_control('ctl00$ContentPlaceHolder1$ddlCity')
    city_items = ctl.get_items()
    return city_items[1:]

br = mechanize.Browser()
br.open('http://www.marutisuzuki.com/Maruti-Price.aspx')    
br.select_form(predicate=select_form)
br.form['ctl00$ContentPlaceHolder1$ddlmodel'] = ['AK'] # model = Maruti Suzuki Alto K10                                                                                              

for state in get_state_items(br):
    # 1 - Submit form for state.name to get cities for this state                                                                                                                    
    br.select_form(predicate=select_form)
    br.form['ctl00$ContentPlaceHolder1$ddlState'] = [ state.name ]
    br.submit()

    # 2 - Now the city dropdown is filled for state.name                                                                                                                             
    for city in get_city_items(br):
        br.select_form(predicate=select_form)
        br.form['ctl00$ContentPlaceHolder1$ddlCity'] = [ city.name ]
        br.submit()

        s = BeautifulSoup(br.response().read())
        t = s.find('table', id='ContentPlaceHolder1_dtDealer')
        r = re.compile(r'^ContentPlaceHolder1_dtDealer_lblName_\d+$')

        header_printed = False
        for p in t.findAll('span', id=r):
            tr = p.findParent('tr')
            td = tr.findAll('td')

            if header_printed is False:
                str = '%s, %s' % (city.attrs['label'], state.attrs['label'])
                print str
                print '-' * len(str)
                header_printed = True

            print ' '.join(['%s' % x.text.strip() for x in td])

这篇关于Web Scraper用于python中的动态表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆