提取HTML表单的字段名称-Python [英] Extracting Fields Names of an HTML form - Python

查看:117
本文介绍了提取HTML表单的字段名称-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设有一个链接"http://www.someHTMLPageWithTwoForms.com",它基本上是一个具有两种形式(例如,形式1和形式2)的HTML页面.我有这样的代码...

Assume that there is a link "http://www.someHTMLPageWithTwoForms.com" which is basically a HTML page having two forms (say Form 1 and Form 2). I have a code like this ...

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
h = httplib2.Http('.cache')
response, content = h.request('http://www.someHTMLPageWithTwoForms.com')
for field in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
        if field.has_key('name'):
                print field['name']

这将返回属于HTML页面的Form 1和Form 2的所有字段名称.有什么办法可以让我仅获取属于特定表单的字段名称(仅表单2)?

This returns me all the field names that belong both to the Form 1 and Form 2 of my HTML page. Is there any way I can get only the Field names that belong to a particular form (say Form 2 only)?

推荐答案

使用lxml进行这种解析也非常容易(由于其对Xpath的支持,我个人更喜欢BeautifulSoup).例如,以下代码片段将打印属于名为"form2"的表单的所有字段名称(如果有的话):

Doing this kind of parsing would also be quite easy using lxml (which i personally prefer over BeautifulSoup because of its Xpath support). For example, the following snippet would print all fields names (if they have one) which belong to forms named "form2":

# you can ignore this part, it's only here for the demo
from StringIO import StringIO
HTML = StringIO("""
<html>
<body>
    <form name="form1" action="/foo">
        <input name="uselessInput" type="text" />
    </form>
    <form name="form2" action="/bar">
        <input name="firstInput" type="text" />
        <input name="secondInput" type="text" />
    </form>
</body>
</html>
""")

# here goes the useful code
import lxml.html
tree = lxml.html.parse(HTML) # you can pass parse() a file-like object or an URL
root = tree.getroot()
for form in root.xpath('//form[@name="form2"]'):
    for field in form.getchildren():
        if 'name' in field.keys():
            print field.get('name')

这篇关于提取HTML表单的字段名称-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆