Python 3,使用请求(库)填写表单会返回相同页面的HTML,而无需输入参数 [英] Python 3, filling out a form with request (library) returns same page HTML without inputting parameters

查看:60
本文介绍了Python 3,使用请求(库)填写表单会返回相同页面的HTML,而无需输入参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用请求在 https://www.doleta.gov/tradeact/taa/taa_search_form.cfm 并返回打开的新页面的HTML,并从新页面中提取信息.

I am trying to use requests to fill out a form on https://www.doleta.gov/tradeact/taa/taa_search_form.cfm and return the HTML of the new page that this opens and extract information from the new page.

这是相关的HTML

  <form action="taa_search.cfm" method="post" name="number_search" id="number_search" onsubmit="return validate(this);">
    <label for="input">Petition number</label>
    :
    <input name="input" type="text" size="7" maxlength="7" id="input">
    <input type="hidden" name="form_name" value="number_search" />
    <input type=submit value="Get TAA information" />
  </form>

这是我要使用的python代码.

Here is the python code I am trying to use.

url = 'https://www.doleta.gov/tradeact/taa/taa_search.cfm'
payload = {'number_search':'11111'}
r = requests.get(url, params=payload)
with open("requests_results1.html", "wb") as f:
    f.write(r.content)

当您手动执行查询时,此页面将打开 https://www.doleta.gov/tradeact/taa/taa_search.cfm .

When you perform the query manually, this page opens https://www.doleta.gov/tradeact/taa/taa_search.cfm.

但是,当我使用上述Python代码时,它将返回 https:的HTML://www.doleta.gov/tradeact/taa/taa_search_form.cfm (第一页),没什么不同.

However, when I use the above Python code, it returns the HTML of https://www.doleta.gov/tradeact/taa/taa_search_form.cfm (the first page) and nothing is different.

我无法在 https://www.doleta.gov/tradeact上执行类似的代码/taa/taa_search.cfm ,因为它重定向到第一个URL,因此,运行代码将返回第一个URL的HTML.

I cannot perform similar code on https://www.doleta.gov/tradeact/taa/taa_search.cfm because it redirects to the first URL and thus, running the code returns the HTML of the first URL.

由于计算机的权限设置,我无法重定向PC的路径(这意味着Selenium不在桌面上),并且我无法安装Python 2(这意味着机械化不在桌面上).我愿意使用urllib,但不太了解该库.

Because of the permissions setup of my computer, I cannot redirect the path of my PC (which means Selenium is off the table) and I cannot install Python 2 (which means mechanize is off the table). I am open to using urllib but do not know the library very well.

我需要执行此操作约10,000次才能抓取信息.我可以自己构建迭代部分,但无法弄清楚如何使基本函数正常工作.

I need to perform this action ~10,000 times to scrap the information. I can build the iteration part myself, but I cannot figure out how to get the base function to work properly.

推荐答案

第一个观察结果是,您似乎在示例代码中使用了 get 请求,而不是使用 post 请求.

The first observation is that you seem to be using a get request in your example code instead of a post request.

<form action="taa_search.cfm" method="post" ...>
                              ^^^^^^^^^^^^^

更改为 post 请求后,我仍然得到与您相同的结果(来自主搜索表单页面的html).经过一些试验,我似乎可以通过在标题中添加 referer 来获得正确的html结果.

After changing to a post request, I was still getting the same results as you though (html from the main search form page). After a bit of experimentation, I seem to be able to get the proper html results by adding a referer to the header.

这是代码(出于示例目的,我仅注释掉写入文件的部分):

Here is the code (I only commented out the writing to file part for example purposes):

import requests

BASE_URL = 'https://www.doleta.gov/tradeact/taa'


def get_case_decision(case_number):
    headers = {
        'referer': '{}/taa_search_form.cfm'.format(BASE_URL)
    }
    payload = {
        'form_name': 'number_search',
        'input': case_number
    }
    r = requests.post(
        '{}/taa_search.cfm'.format(BASE_URL),
        data=payload,
        headers=headers
    )
    r.raise_for_status()
    return r.text
    # with open('requests_results_{}.html'.format(case_number), 'wb') as f:
    #     f.write(r.content)

测试:

>>> result = get_case_decision(10000)
>>> 'MODINE MFG. COMPANY' in result
True
>>> '9/12/1980' in result
True
>>> result = get_case_decision(10001)
>>> 'MUSKIN CORPORATION' in result
True
>>> '2/27/1981' in result
True

由于您提到需要执行此操作约10,000次,因此您可能希望使用 requests.Session 也是如此.

Since you mentioned that you need to perform this ~10,000 times, you will probably want to look into using requests.Session as well.

这篇关于Python 3,使用请求(库)填写表单会返回相同页面的HTML,而无需输入参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆