Python 3，使用请求(库)填写表单会返回相同页面的HTML，而无需输入参数 [英] Python 3, filling out a form with request (library) returns same page HTML without inputting parameters

查看：60 发布时间：2021/5/14 20:18:27 python html selenium python-requests mechanize

本文介绍了Python 3，使用请求(库)填写表单会返回相同页面的HTML，而无需输入参数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用请求在 https://www.doleta.gov/tradeact/taa/taa_search_form.cfm 并返回打开的新页面的HTML，并从新页面中提取信息.

I am trying to use requests to fill out a form on https://www.doleta.gov/tradeact/taa/taa_search_form.cfm and return the HTML of the new page that this opens and extract information from the new page.

这是相关的HTML

  <form action="taa_search.cfm" method="post" name="number_search" id="number_search" onsubmit="return validate(this);">
    <label for="input">Petition number</label>
    :
    <input name="input" type="text" size="7" maxlength="7" id="input">
    <input type="hidden" name="form_name" value="number_search" />
    <input type=submit value="Get TAA information" />
  </form>

这是我要使用的python代码.

Here is the python code I am trying to use.

url = 'https://www.doleta.gov/tradeact/taa/taa_search.cfm'
payload = {'number_search':'11111'}
r = requests.get(url, params=payload)
with open("requests_results1.html", "wb") as f:
    f.write(r.content)

当您手动执行查询时，此页面将打开 https://www.doleta.gov/tradeact/taa/taa_search.cfm .

When you perform the query manually, this page opens https://www.doleta.gov/tradeact/taa/taa_search.cfm.

但是，当我使用上述Python代码时，它将返回 https:的HTML://www.doleta.gov/tradeact/taa/taa_search_form.cfm (第一页)，没什么不同.

However, when I use the above Python code, it returns the HTML of https://www.doleta.gov/tradeact/taa/taa_search_form.cfm (the first page) and nothing is different.

我无法在 https://www.doleta.gov/tradeact上执行类似的代码/taa/taa_search.cfm ，因为它重定向到第一个URL，因此，运行代码将返回第一个URL的HTML.

I cannot perform similar code on https://www.doleta.gov/tradeact/taa/taa_search.cfm because it redirects to the first URL and thus, running the code returns the HTML of the first URL.

由于计算机的权限设置，我无法重定向PC的路径(这意味着Selenium不在桌面上)，并且我无法安装Python 2(这意味着机械化不在桌面上).我愿意使用urllib，但不太了解该库.

Because of the permissions setup of my computer, I cannot redirect the path of my PC (which means Selenium is off the table) and I cannot install Python 2 (which means mechanize is off the table). I am open to using urllib but do not know the library very well.

我需要执行此操作约10,000次才能抓取信息.我可以自己构建迭代部分，但无法弄清楚如何使基本函数正常工作.

I need to perform this action ~10,000 times to scrap the information. I can build the iteration part myself, but I cannot figure out how to get the base function to work properly.

推荐答案

第一个观察结果是，您似乎在示例代码中使用了 get 请求，而不是使用 post 请求.


The first observation is that you seem to be using a get request in your example code instead of a post request.
<form action="taa_search.cfm" method="post" ...>
                              ^^^^^^^^^^^^^

更改为 post 请求后，我仍然得到与您相同的结果(来自主搜索表单页面的html).经过一些试验，我似乎可以通过在标题中添加 referer 来获得正确的html结果.
After changing to a post request, I was still getting the same results as you though (html from the main search form page). After a bit of experimentation, I seem to be able to get the proper html results by adding a referer to the header.
这是代码(出于示例目的，我仅注释掉写入文件的部分):
Here is the code (I only commented out the writing to file part for example purposes):
import requests

BASE_URL = 'https://www.doleta.gov/tradeact/taa'


def get_case_decision(case_number):
    headers = {
        'referer': '{}/taa_search_form.cfm'.format(BASE_URL)
    }
    payload = {
        'form_name': 'number_search',
        'input': case_number
    }
    r = requests.post(
        '{}/taa_search.cfm'.format(BASE_URL),
        data=payload,
        headers=headers
    )
    r.raise_for_status()
    return r.text
    # with open('requests_results_{}.html'.format(case_number), 'wb') as f:
    #     f.write(r.content)

测试:
>>> result = get_case_decision(10000)
>>> 'MODINE MFG. COMPANY' in result
True
>>> '9/12/1980' in result
True
>>> result = get_case_decision(10001)
>>> 'MUSKIN CORPORATION' in result
True
>>> '2/27/1981' in result
True

由于您提到需要执行此操作约10,000次，因此您可能希望使用  requests.Session  也是如此.
Since you mentioned that you need to perform this ~10,000 times, you will probably want to look into using requests.Session as well.

                        这篇关于Python 3，使用请求(库)填写表单会返回相同页面的HTML，而无需输入参数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Python 3，使用请求(库)填写表单会返回相同页面的HTML，而无需输入参数 [英] Python 3, filling out a form with request (library) returns same page HTML without inputting parameters

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python 3，使用请求(库)填写表单会返回相同页面的HTML，而无需输入参数 [英] Python 3, filling out a form with request (library) returns same page HTML without inputting parameters

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭